Thursday, October 28, 2004
MT book of the event
For 45 €, you can buy Machine Translation: From Real Users to Research, the book of the 6th Conference of the Association for Machine Translation in the Americas held in early October, 2004.
The 30 revised papers presented were carefully reviewed and selected for inclusion in the book. The papers address all current issues in machine translation ranging from theoretical and methodological topics to applications in various contexts and evaluation and analysis of user needs and systems.
Translation Technology • (0) Comments • (1) Trackbacks • Permalink • Delicious • Digg • Ma.gnolia • StumbleUpon • Design Float • Reddit • Twitter
XLIFF explained
Rodolfo M. Raya has recently published a second article in his series on XML for localization.
This second part focuses on XML Localisation Interchange File Format (XLIFF) and explains with practical examples how to use it for translating different kinds of documents. This article presents a step-by-step guide to translating multilingual documents using XLIFF as an intermediary file format, and provides useful resources for localizing Java applications.
The first article briefly explained the most relevant XML standards used in the localization industry. By the way, the s/z shift in spelling localized is intentional – and a good example of spelling localis/zation.
Translation Technology • (0) Comments • (0) Trackbacks • Permalink • Delicious • Digg • Ma.gnolia • StumbleUpon • Design Float • Reddit • Twitter
Wednesday, October 27, 2004
Spleak speaking
If you use MSN Messenger and like playing English language games with chatterbots, try the beta version of Spleak, a female avatar produced by IMT Labs, a Danish start-up dedicated to enhancing the web interface experience. You can download Spleak for free and then yak away.
The underlying idea is that Spleak will act as an interface to information and eventually other services over and from the web. For this beta version, you can ask for word definitions, find out facts on U.S. presidents (!) and even ask for a Shakespeare play to read. Spleak naturally has Hamlet, Prince of Denmark, which then unfurls a dozen or so lines at a time in your Messaging window. Well, it’s a start. This beta is designed to help IMT Lab build up a corpus of questions, language usage etc. to expand the knowledge base and enrich the user experience. But it is not clear whether the program actually uses your query to consult existing web resources.
If you start trying to chat Spleak up, she does pretty well at evading the issue. If you treat her like ELIZA, the mythic shrinkbot program built by Joseph Weizenbaum in 1966, and tell Spleak you have a problem, she’ll probably reply “Whatever.” And when I asked why she’s called Spleak, she offered me a spell checker. As Hamlet would have said, Glet thlee to a nlunnery!
Language Industry News and Events • (0) Comments • (0) Trackbacks • Permalink • Delicious • Digg • Ma.gnolia • StumbleUpon • Design Float • Reddit • Twitter
Tuesday, October 26, 2004
The virtues of Simple English Wikipedia
One of the more endearing aspects of the Wikipedia (online open source inspired encyclopedia) project is its decision to include Simple English among the localized versions of various Wikipedia articles. Indeed, the main page of the Simple English section is itself written in, err… Simple English. It is targeted at English learners and teachers, a little like the Collins Cobuild dictionary for learners which used ‘simple’ English in the definitions. The Simple English versions are also designed to help the many translators working on English originals.
Wikipedia has come in for considerable shtick from professional librarians and the like since, as a grass-roots open publishing enterprise generated by unpaid amateurs, it is open to abuse of all kinds. But as this recent article suggests, the very scope and reactivity of the production community ensures that errors and idiocies are removed faster than you can say Britannica.
What intrigues me, though, is that Wikipedia offers an extraordinary resource for people interested in the writing process, as the whole history of modifications are stored in the history section of any given article. Unlike most documents, whether digital or ex-arboreous, you can inspect the way Wikipedia’s articles have been edited, expanded, rewritten for readability and so on. And then you can go to a translated version, of the article and see the same process of editing applied to the new version. A rare treasure after five hundred years of final-version-only print publishing.
Which is where the Simple English Wiki comes in. First, it shows that writers can more or less easily ‘simplify’ their discourse, even without much training or theory. Second, the S E discussion forum offer a instructive glimpse into how various aspects of simple writing are thrashed out by real world practitioners. And third, the actual encyclopedia entries together with their editing histories offer a superb training resource for people interested in the potential for using ‘simple’ expressive means (sentence length, vocabulary, grammatical rules, etc) in expository discourse.
Localization Culture • (0) Comments • (2) Trackbacks • Permalink • Delicious • Digg • Ma.gnolia • StumbleUpon • Design Float • Reddit • Twitter
Must-have translation and localization resources
In the wake of the great native speaker quality English machine translation system, how about a map of the London underground fully localized into German? Mind the gap at Nimdill Zirkus!
Localization Culture • (0) Comments • (1) Trackbacks • Permalink • Delicious • Digg • Ma.gnolia • StumbleUpon • Design Float • Reddit • Twitter
Monday, October 25, 2004
Language police strike again
UK schools are to be given official guidance (a 60 page booklet called Introducing the Grammar of Talk) on how to teach pupils standard spoken English. The Qualifications and Curriculum Authority says teaching about speaking is more difficult for many teachers than teaching writing skills, for which there is a well-developed system of grammar. It hopes the initiative will improve their social skills and career prospects.
Meanwhile in Poland, there is outrage about how Wojciech Pormorski has been told by the Hamburg (Germany) city authorities that regular dialogue in Polish with his daughters, aged seven and four, would hinder their integration into German society. He is separated from his wife and has been denied access to their daughters after refusing to guarantee that he would speak to them only in German.
Language Industry News and Events • (0) Comments • (1) Trackbacks • Permalink • Delicious • Digg • Ma.gnolia • StumbleUpon • Design Float • Reddit • Twitter
New web terminology interface
Babeling, a French terminology engineering software developer, has just released Motilus, an interface to multi-database term queries over the web. Or as Babeling puts it:
Motilus is a unique interrogation engine software of monolingual and multilingual terminology resources and dictionaries available on the Internet. Its innovation is its ability to select and query, simultaneously, relevant dictionaries and terminology resources in a particular language combination in seconds. With both single-user or network (intranet, extranet) applications, Motilus’ high processing speed and user-friendly, attractive design set it apart.
Motilus’ originality is to spare the user from painstaking research and repeated querying of the lexicographical and terminological resources, by selecting, for each resource, the source language, the target language(s), and other possible criteria as well as typing in the searched for word. With Motilus, the user selects once: the source language, target language, and types in the word. With one click he/she will obtain a comprehensive list of resources containing the word. On clicking again, the content of the chosen resource will be brought up: the words found as a result of the search, their definitions, their translations etc.
Was the company aware of an English near-homonym of its corporate name? The product sounds useful, but I found it hard to get my head around the website blurb and services.
Translation Technology • (0) Comments • (3) Trackbacks • Permalink • Delicious • Digg • Ma.gnolia • StumbleUpon • Design Float • Reddit • Twitter
MT = mobile translation in Singapore
GistXL, an English (or Singlish) to Simplified Chinese SMS translation platform embedded in SingTel’s Singapore network, won a Merit Award in the recent Singapore National Infocomm Awards held last week. According to The Straits Times:
GistXL lays claim to be one of the first (free translation systems) in the world to build artificial intelligence into the system. This means, while other “rule-based” systems match an English word to a Chinese one, GistXL sieves through its database – within a fifth of a second – to draw links and contexts between words and phrases.
To be picky, perhaps they meant ‘draw links between words and their context” or “contextualize words in phrases” rather than actually “draw contexts”. And it’s a pity to trot out the old AI saw again. The system is presumably what MT folks call ‘data driven’. According to their blurb, GistXL plans to scale up to 18 languages by the end of 2005.
Clearly, the stripped-down expressive forms (aka controlled language) found in text messaging, together with the platform constraints on volume (GisXL will do a around 25 words per translation) are proving to be a fruitful field for commercial language engineers. It would interesting to know in detail how standardized actual SMS messages are becoming in the various linguistic sub-groups that use them. GistXL for example is focused on the specifics of Singlish (Singapore English), as filtered through the space and typing constraints of SMS. As it happnes, Singaporeans can be pretty good at inputting ‘standard’ English too. In June this year, a student called Kimberly Yeo, broke the world record by typing 26 words into his cell phone in 43.24 seconds. At that rate, he could send 83 messages (or 2,165 words) an hour. A golden egg for his telecoms supplier.
Translation Technology • (0) Comments • (0) Trackbacks • Permalink • Delicious • Digg • Ma.gnolia • StumbleUpon • Design Float • Reddit • Twitter
Friday, October 22, 2004
It's all just metadata
A think piece on why all data is becoming metadata, from DavidWeinburger
There used to be a difference between data and metadata. Data was the suitcase and metadata was the name tag on it. Data was the folder and metadata was its label. Data was the contents of the book and metadata was the Dewey Decimal number on its spine. But, in the Third Age of Order (see the previous issue), everything is becoming metadata.
For example, imagine you’re at a large corporation doing a Third Order treatment of its digital library of research articles. Instead of (or, in addition to) designing a large, complex, hierarchical taxonomy, you focus on adding enough metadata to each article so that people will be able to sort and classify them any which way they want. If someone wants to find all the articles that talk about hydrocarbons written in Italian in 1965 and that have more than 30 footnotes, they’ll be able to. If someone wants to make a browsable hierarchy based not on topic but on gender or on the number of co-authors, they’ll be able to. You build enriched objects first so your users can forever after taxonomize the way they want to, instead of the way you think they’ll want to.
Now take a closer look at these information objects. They look like contents tagged with lots of metadata, but in fact they’re all metadata. If I’m looking for an article about hydrocarbons written by Barbara Rodriguez, then the article’s topic ("hydrocarbons") and author’s name ("Rodriguez, Barbara") are metadata, and the content is the data. But, I could just as well be trying to remember the name of the author who wrote an article that included the phrase “Hydrocarbons are the burros of the the cosmos” sometime in the 1960s, in which case the content and date are metadata and the author’s name is the data. What’s data and what’s metadata depends on the person doing the asking.
So, in the Third Age of Order, all data is metadata. Contents are labels. Data is all surface and no insides. It’s all handles and no suitcase. It’s a folder whose content is just another label. It’s all sticker and no bumper.
Why does this matter? It changes the primary job of information architects. It makes stores of information more useful to users. It enables research that otherwise would be difficult, thus making our culture smarter overall. But, most interestingly (at least to me), this does the ol’ Einsteinian reverse flip to Aristotle. Aristotle assumed that of the 10 categories by which one could understand a thing, one must be primary: Where that thing fits into the tree of knowledge. So, you could say that Alcibiades is made of flesh or lived in Greece, but if you really want to understand him, you have to say that he is an animal of a particular kind. But, now that everything is metadata, no particular way of understanding something is any more inherently valuable than any other; it all depends on what you’re trying to do. The old framework of knowledge — and authority — are getting a pretty good shake.
Personalization and Design • (0) Comments • (0) Trackbacks • Permalink • Delicious • Digg • Ma.gnolia • StumbleUpon • Design Float • Reddit • Twitter
Lex dura
The Japanese justice system is undergoing wide-ranging reform. The Office for Promotion of Justice System Reform working within the Cabinet has just today expressed an “urgent need for an efficient and systematic mechanism to translate Japanese legislation into foreign languages in response to calls from lawyers and businesspeople.” The communiqué goes on:
Despite Japan being the world’s second-largest economy, the government has no unified system (sic) to translate Japanese laws into English and other languages. Some major laws have been translated by related ministries or the private sector, but the quality of the translation is often poor and terminology varies.
The group under the Office for Promotion of Justice System Reform of the Prime Minister’s Office suggests that government-led translation of the laws is necessary to facilitate international trade and promote foreign investment in Japan, as well as to assist foreign residents in the country.
“It has long been a big problem that Japanese laws have not been translated coherently and systematically,” said Noboru Kashiwagi, a professor of law at Tokyo’s Chuo University and a member of the group. “Today marks the first step and it is very meaningful.”
The proposal also advises utilizing a computerized system developed by Nagoya University, which includes an electronic dictionary and a database of past translations. But the group has yet to reveal an estimated cost for the project.
It remains undecided whether the task should be taken up solely by the government, to be commissioned to the private sector, or to be done by establishing a new quasi-governmental organization. But the group is envisioning a public-private collaboration.
Now, would it be worth while suggesting to Japan that they check on how the European Union has handled a similar problem – the translation of Community law into local Member State languages as part of membership commitment? Probably not. As far as I know, there is no ’systematic mechanism’ or ‘unified system’ in place to help EU translators through the immense task of localizing the canonical 85,000 pages of EU law. In the past, though, this very process has largely contributed to the development of a translation ‘culture’ in countries such as Denmark, Sweden, Finland and Greece, and has usually been handled via government contracts to legal translators, working in groups.
Since this translation work is ruled by the subsidiarity principle, it is up to each country (rather than a central agency) to get the job done. The exceptional nature of the EU situation lies precisely in having a Member State translate and integrate EU legislation into its own national law and language. Almost by definition, nations normally draft their own laws, rather than localize someone else’s. And here lies the rub: the process involves translators, lawyers, revisers, and anyone else working to ensure that the resulting texts are legally watertight, and optimally adapted to the local context. A hard process to automate as a unified system, but worth attempting surely. EU legal translation is, if you like, inward bound, exclusively for the use of the new Member State.
In the case of Japan, however, it looks as if the translation is outward bound to the world at large. Yet presumably the proposed ‘English’ translation of Japanese law has to use English terms from somewhere – the U.S., Canada, the U.K., Australia etc, and is therefore implicitly localizing to a legal culture, if not a specific country. For an example of the translation of the Arbitration Law, see here.
The patent lack of systematic mechanisms or unified systems for localizing EU law was brought home to me the other day by a report in Eurolang, the news site for Lesser Used Languages in Europe, which proudly stated that Slovenia, one of the newest members of the EU, has developed some ‘computer-based translation tools’ to help in its EU translation effort. What Slovenia has actually done is to develop “a collection of 62,000 technical-term equivalents, called Evroterm (using Trados’ Multiterm), and a bilingual (English-Slovenian) aligned corpus of translations of legal acts by the EU, called Evrokorpus (currently 8.7 million words). You can access them at (SVEZ) website. But I would call these ‘digital resources’ rather than tools.
What is astonishing to me is that this should be news. Why is it that after a decade of European Commission language technology R&D programs, some with multi-million euro funding, we have no better technology to handle these fundamental translation tasks than a term management system first built in the late 1980s and standard corpus (or translation memory) software? I’m sure that the experience of other Member State translation efforts has propagated some form of best practice for localizing Lex europae. I am equally sure that the reason why that there is no rich corpus of bilingual translations in the newer EU countries to feed translation memories and help automate some of the slog, is because national legal content has typically not needed to be translated.
But looking back, it seems a singular error for EC technology czars to have failed to fund innovative projects (even if they proved inadequate) to this critical EU task. They might just have come up with a more “unified system” (e.g. a set of legal ontologies to drive a translation engine) to help new Members get this job done quicker, better, cheaper. And they might have been able to promote it to the Japanese to help with their legal translation projects. Come to think of it, the Japanese too sunk trillions of yen in 5th Generation computing, producing electronic dictionaries, machine translation systems, knowledge bases and so on. All to no apparent avail. Lex dura indeed, sed lex - and highly lexical with it.
Localization Culture • (0) Comments • (0) Trackbacks • Permalink • Delicious • Digg • Ma.gnolia • StumbleUpon • Design Float • Reddit • Twitter
Cool typing trivia
Excessive verbiage typed into a chat room or IMS could suggest you’re lying. A report from Cornell University has found that monologuing liars
“talk too much, use more pronouns about others and use more terms about the senses, such as “see,” “hear” and “feel,” than people telling the truth, according to a new study by Cornell communication experts.
“Our study suggests that people who are lying to another person in a chat room or in instant messaging use approximately one-third more words, probably in their attempt to construct a more cohesive and detailed story in order to seem believable,” said Jeff Hancock, assistant professor of communication in the College of Agriculture and Life Sciences (CALS).”
On the other hand, if the ambient temperature isn’t warm enough in your office, you tnde to ùmake more typring mishtakes. Simson Garfinkel blogs yet another Cornell report :
“When the office temperature in a month-long study increased from 68 to 77 degrees Fahrenheit, typing errors fell by 44 percent and typing output jumped 150 percent. Hedge’s study was exploring the link between changes in the physical environment and work performance.
“The results of our study also suggest raising the temperature to a more comfortable thermal zone saves employers about $2 per worker, per hour,” says Hedge, who presented his findings this summer at the 2004 Eastern Ergonomics Conference and Exposition in New York City.
In other words, a chatroom liar in a cold room will type more and with more errors, than their truth-telling counterpart in a warm room. True or flase?
Language in Business • (0) Comments • (0) Trackbacks • Permalink • Delicious • Digg • Ma.gnolia • StumbleUpon • Design Float • Reddit • Twitter
Thursday, October 21, 2004
The Iliad a public hit in Italy
Alessandro Barrico, the Italian translator of Homer, has outperformed The Da Vinci Code in sales in Italy with his new Italian version of the Iliad, published by Feltrinelli.
Enthusiasm for the book has led to public readings, starting 2,600 people listening for 12 hours a night for three nights to the spoken translation in Rome. On October 8, the book’s 20,000 or so lines were read in relays by people from different backgrounds in Verona for 24 hours. A new. The next step in this mass vote for the old bard and his martial themes could be to memorize the whole thing.
It wouldn’t be the first time some one has tried, though. Although it’s hard to find out much about it, a retired U.S. businessman called Steven V.N. Powelson decided to memorize the whole Iliad, and was able to perform vast chunks of it around 1994. This meant getting your personal Mnemosyne to get its aural archive around some 200,000 spoken syllables. Powelson took 16 years, working about an hour a day (making a total of 5,840 hours) as befits a 76 year old. He apparently used to recite it as part of a national tour to promote the classics, but I cannot discover whether he ever stood up like Italy’s war-scarred public (Iraq not Troy) and performed the whole thing from beginning to end. For a weird commentary on the ‘autoerotics’ of memorizing poems see this
Here’s how Barrico summarized his motivations:
Quando racconto questo lavoro, spesso la gente mi chiede: perché proprio l’Iliade? Alcuni vorrebbero l’Odissea (che io non amo, tranne il finale), o magari Dante o Ariosto. Ho due risposte: la prima è che l’Iliade mi sembra una storia bellissima. La seconda è che godere del racconto di una guerra mi sembra una cura efficace per allontanare il desiderio (tragico ma legittimo) di godere facendo la guerra.
Localization Culture • (0) Comments • (0) Trackbacks • Permalink • Delicious • Digg • Ma.gnolia • StumbleUpon • Design Float • Reddit • Twitter
