TRANSLATION IN THE 21ST CENTURY
Laurie Gerber started out as Japanese dictionary “coder” at Systran in late 1986. Fast forward twenty-five years and she is a senior industry figure, advising government and industry on major machine translation research and deployments. TAUS consultant Colin Brace caught up with Laurie to get her take on the progress made with machine translation (MT) technology in the last 25 years and the challenges that remain. We also asked Laurie to provide a detailed critique of issues with newer data-driven approaches to MT, such as those used by Asia Online, Google, IBM, Microsoft and SDL Language Weaver, amongst others.
What’s your general impression of the progress that has been made in the MT space since you started out?
MT has gone from being a somewhat eccentric pursuit to being practically commonplace and mainstream. In the 1990s when Systran had a customer who wanted Systran and Trados integrated, it was a big effort to make it happen. In 2002-2003 when I was exploring ways to get Language Weaver into translation workflows, I approached a wide variety of TM developers. Trados had not moved beyond its old Systran integration. And all the others said that there was no customer demand for MT integration. Now, lo and behold, “hooks” for MT APIs are becoming common, if not de riguer for TM products. While most offer real time integration with free online MT, several also offer easy integration with other locally installed or remotely hosted MT systems.
But keep in mind though that much of the adoption and uptake of MT is not in high value/high profile translation but in lower perceived value content – especially that associated with business cost centers such as customer support where there is constant pressure to cut costs.
Medium-high quality out of the box MT for news or other edited, non-specialized text, between major commercial European languages and English is now a reality. Unfortunately, established players still are struggling to differentiate based on this capability.
What are the main technical challenges that remain largely unresolved?
High-quality output is rarely achieved when working from linguistically underspecified languages (Chinese, English) to highly specified (morphologically/grammatically rich/complex) languages such as Arabic. Going from a grammatically rich language to a grammatically less-specified language is easy – information can be lost without harming the translation. But going to grammatically richer languages requires the translation system to infer or perhaps manufacture information that is not present in the source language. Translation between languages with very different basic word order is also a problem.
At a more practical level, best practices in content creation and publishing haven’t been widely embraced. These include using controlled language if possible, being rigorous in use of standard terminology in authoring and in applying it when translating, customizing MT for each subject area, managing quality of TM carefully. Alas, few organizations have the stamina to fully adopt these, or to sustain them over the long run.
There’s a lot of heated debate about the benefits of newer data-driven approaches to MT versus traditional, purely rule-based methods. What are the issues with these newer approaches?
Next to automated training and customization, improved fluency and readability of MT output is the important contribution of SMT. Rule-based MT has the demerit that even 100% accurate and correct translations can be unreadable because they are stilted and unnatural. The fact that someone who is trying to gather information or to postedit will keep trying longer to work with MT output is a big success. But that fluency can hide structural errors.
SMT and statistical postediting applied to rule-based engines have introduced the ability to learn from user feedback in the form of training systems on user-provided data. That’s a big plus. But the training process often requires too much data, especially since people often want to use MT to go into new areas and start offering translation of material that they have not been translating in the past. SMT also learns surprising and undesirable things from data, yet SMT does not yet enable targeted correction of errors. I think this is one of the most important areas for research – to bring the creative and competitive energies of the growing SMT research community to bear on targeted error correction in SMT. But so far this is not at all a focus for the research community.
SMT hasn’t developed strong, broadly applicable models for complex morphology. The languages tackled early in SMT – translating from French, Chinese and Arabic into English – allowed this problem to be deferred. People thought that Arabic would be a problem, but it turns out that it does not have the most complex morphology, and there is an abundance of data.
Morphology, which was a routine solved problem for rule-based MT developers, emerges as a theoretical and practical issue in SMT that has not been well addressed yet. In SMT, each word form is treated as a separate word (“going” and “went” are not linked to “go” in SMT). For morphologically complex languages (especially those with highly productive affixation resulting in very large numbers of word forms – Turkish, Japanese, Finnish), it is virtually impossible to get enough examples of each word form for an SMT system to learn from. At worse the problem of not-found-words of rule-based systems reappears, and at best the models based on low frequency items are simply not very strong.
Also, SMT doesn’t yet have robust or high performing models for language structure and grammatical relationships. Phrase-based models and local reordering often work fine in short sentences, but in long sentences (over 15-20 words), this breaks down. This is also true for rule-based MT as sentences get longer. The research community will object – that many research groups have syntax-based models now. But these aren’t mature enough to change the market landscape just yet.
Then there is the inability to do targeted error correction: SMT learns what it learns from the data. While developers can review the statistically learned correspondence tables (parameters) and weed out obvious errors, a large SMT system will have so many millions of parameters, that human review is not generally feasible. The answer for customers is to give the system more good data examples to learn from. Or to go through and really “clean” the training data. But at the volumes of data that SMT likes to learn from, this is also a major undertaking.
You have highlighted some issues around training data. This is an area where TAUS is deeply involved through the TAUS Data Association. Can you explain to readers the importance of language data for training SMT and hybrid systems?
The so-called “blue ocean” opportunity for MT – to explosively grow the size of the translation market by making it possible to translate material that has not been translated before, will require relevant training data examples to learn from. And the quantities of data needed are necessary for every language pair direction and domain.
Where there are abundant sources of bilingual data, this has enabled development of high quality MT systems for languages and subject areas that are already frequently translated. However, in translation for information gathering, often users want to translate less commonly taught languages, and less standard text types (blogs, eBay offerings). Finding even a million words of translated data for these is currently almost impossible. Commercial users often want to introduce MT to translate material that they have not been able to translate before – for which they also do not have training data.
There’s the impression that there has been a disconnect between the needs of commerce and the research agenda. Why do you think that is?
While the pool of commercial MT developers has not grown hugely (several have entered the field, but several have left the field as well), the research community has exploded, with many new university computer science departments around the world establishing permanent faculty roles/specialties in MT research.
DARPA and the European Nth (various generations) framework programs have been hugely influential in shaping the agenda and goals of that research. I’m not up-to-date on the goals of the European funding programs, but in the US, the emphasis has been entirely on (not surprisingly) US military/intelligence interests. To turn this into a coherent, manageable programmatic goal, the focus has long been on translating published or broadcast news, and on automated “objective” performance metrics, primarily BLEU scores. However, this emphasis has distracted researchers entirely from things that the commercial market cares about, with the result that commercial demands for things like nice dictionary-building interfaces, customization, or postediting tools, seem like an unreasonable diversion of talent from “the real/interesting problem” of getting the best BLEU score. Although, this is beginning to change – DARPA has adopted new edit-distance metrics to balance automated evaluations, but BLEU is still the defining metric for most researchers because it is virtually free and instant. The informal self evaluations going on all the time still focus on what is handy – the NIST test sets which keeps the focus on news.
Automatic evaluation (BLEU being just the most common and best known) has both energized competition and achievement among researchers, but it has also caused an extreme focus on an evaluation metric that doesn’t always reflect commercial market priorities.
I think that the current flush of research interest and effort is unprecedented. But there have been smaller scale renaissances of MT research in the 1960s and 1980s. However, each time, there really wasn’t enough sustained research funding or enough jobs in the commercial market to keep all of the talent cultivated in universities in MT as a career. Perhaps the current explosion of university computer science departments and faculty positions will encourage more competition/diversification among researchers, some of which may even venture into usability.
Our thanks to Laurie Gerber for contributing to this TAUS Translation in the 21st Century article.
This article raises a number of issues around the use of language data to improve the quality of MT output. If you would like to learn more about how this can be achieved, please take a look at use cases from TAUS Data Association members. And if you want to find out more in person, join us at the upcoming TAUS User Conference 2010 in Portland (OR) from the 3 - 6 October, where leading providers and users will showcase the latest advances.




