Training machine translation engines is a big topic lately. Everyone wants MT but the quality is generally not good enough for business use. So training and customization are crucial to the success of MT. In this article we share our perspective on the trends concerning the complexity of the process and the cleanness of the data.
We invite readers to support and join the growing community of new-generation MT developers. To let a thousand MT systems bloom. To help the world communicate better.
This is a prelude to the TAUS Executive Forum in Copenhagen (May 19-21) where we will spotlight the latest developments in the field. Developments that are rapidly moving the industry closer to ‘push button’ training and customization.
Complex?
For a long time developing a new MT engine was a multi-million dollar project. In the past thirty years we saw MT companies and projects come and go because of the complexity and the cost of developing a reasonably well performing MT engine. The complexity of customizing and building MT engines lay in the laborious coding of dictionaries and the adaptation of grammatical rules.
Developing an MT engine for an entirely new language pair would take a year or two. Logos – one of the early commercial MT companies – found new investors repeatedly to fulfill its dream of automatic translation before finally going under, and surfacing again in an after-life open-source incarnation. (See our article on open-source MT systems.)
The large scale Eurotra project aimed at automatically translating European languages in the eighties is generally remembered as overly ambitious and a failure. Systran is one of the few first-generation MT companies that survived under its own brand.
The long history of failures and lost fortunes weighs heavily on today’s MT market. We don’t want to burn our fingers again. And yet, the training of new MT engines has become a much simpler process today. In fact, we are well on our way to joining the trend seen in many other areas: towards customer self-service.
In the old days adding dictionary entries to MT engines would require linguistic skills. The ‘customizer’ was asked to provide codes for syntax, morphology and often a semantic class as well. Since a number of years most rule-based engines have implemented automatic coding routines allowing less specialized users to simply upload plain glossaries in Excel or other formats. But the real innovation in customization is of a more recent date. The keyword is ‘statistical’.
You teach a computer how to translate not by trying to teach it rules and dictionaries, but simply by feeding it with translations as well as texts in the target language. The computer will start recognizing patterns in the sequences and forms of words. Using these patterns the computer can generate new translations. This statistical approach to the machine translation problem has opened new possibilities.
There are pure statistical engines, but interestingly rule-based engines have adopted statistical routines as well. Systran for instance combines the two approaches in its new release 7 and promises to add a “training wizard” in the upcoming release 7.2, allowing the user to run an additional round of customization based on domain-related documents automatically harvested from the user’s own computer. Talking about customer self-service... (See the TAUS Technology Briefing on Systran.)
Training a new MT engine now takes days or hours. Soon it will be a fully automatic process and, infact, a user activated feature. This ease of training and customizing MT will catalyze a shift from using general, not-so-good, engines towards adopting a domain or product-specific engine for every job. A revolution indeed!
Clean?
It is clear where the discussion about ‘clean’ and ‘dirty’ data started. MT engines learn from what we feed them. If we feed them bad translations, they will produce bad translations. The problem is that we do not have a clear definition of ´clean´ or ´dirty´ in this respect. One thing is clear: MT engines don’t like formatting tags. They learn from plain text only.
The cleaning process for MT training therefore always starts with stripping out all formatting information, such as HTML and XML tags. Translation memories are an excellent training resource for MT engines, except they are usually stored in the TMX standard with formatting codes before and after each segment and even spread out within segments. Clear enough, MT developers call this ‘dirty’.
Beyond that the definition of ‘clean’ and ‘dirty’ starts to blur. It is confused by the debate about how much data is enough to train a new engine. The experts fight each other with statements like “the more data the better” or “more than 10% dirty data has a negative impact on SMT systems”.
Google is reportedly using as much data as it can put its hands on to train MT engines, proposing unreasonable effectiveness of data and arguing that the occurrence of spelling and grammar errors in a training corpus does not necessarily do any harm when you work with – not millions – but billions of words.
ProMT and Systran, are at the other end of the spectrum, relying on the core value of rules and preferring to use a compact and clean specialized text corpus to finish the customization job.
‘Clean’ – beyond the issue of formatting tags – also means good translations, not different from quality criteria the world of human translations. No spelling errors, no grammatical mistakes, no missing translations in translation memory files, no punctuation mistakes. And if we can have more, let us have consistent use of terminology and good stylistic translations. All common sense and good practice.
All common sense and good practice, but still not enough. In the world of MT developers, the word ‘clean’ also means ‘fit for the job’. As we shift from generic MT engines to domain-specific trained-on-the-job engines, clean data also refers to topic and style.
Ask trainers of MT engines where they can make a difference, what makes one engine better than the other. They will tell you sure there are parameter settings we can play with. But frankly, the biggest difference is found in the data selection.
Finding translations and text that match the domain of the new job at hand, both in topical reference and in style. That is the art in the new-generation of MT engines. Call it ‘clean’ if you like. It is really more of secret recipe that we will all need to unravel to go forward.
Come on!
The new-generation MT systems provide a tremendous opportunity for the translation industry. Since we published our article about open-source MT engines a month ago we have learned about two new open-source developments: improvements to THOT, a SMT engine from Valencia Polytechnic University and Marclator, a new EBMT system released by the MT research group at Dublin City University.
At TAUS we are tracking the exciting experiences of companies pioneering in this new space of training MT engines. They are often small companies: language service providers, spin-offs from universities or new start-ups, such as Pangeanic, Languagelens, Tilde, Digital Silk Road, Safaba, Morphologic, Celer Soluciones, Metatrad, Cross Language.
We expect that dozens of new companies will join this space and dedicate themselves to training and retraining niche MT engines. A thousand MT systems will bloom. As we learn more, we will uncover the recipes for ‘cleaning data’. These recipes will be different for different locales, different domains and different styles. It is these important nuances that will help make this new emerging sector vibrant and dynamic.
TAUS and MT Training
The TAUS Data Association (TDA) is an enabler in this changing market environment. TDA is a non-profit industry-owned member organization aimed at hosting the world’s translation memories. The TDA repository (February 2010) contains close to 2 billion words of translations in 200 language pairs. Our goal is to reach 10 billion words before the end of the year.
Clean or dirty? TDA is focused on supporting the requirements for ‘clean’ data in many different ways. Firstly, we host translations from trusted sources only. You can always check who the owner of the data is and who has provided the data.
We run automatic routines to filter out empty segments, i.e. missing translations. For more informal quality assurance TDA is banking on peer review. The portal provides a five star rating and comments fields for users to assess the quality of translations.
This year we will be adding a statistical routine to automatically check and filter out bad translations by computing an alignment score for each segment based on the sum of probabilities of alignment of source and target words. All together these measures will help to meet basic needs.
But as we have seen, the crux is really in finding the data that match the domain and job. To help users zoom in and out and find the right data, TDA stores every translated segment with attributes for source and target locale (of course), data owner or source, product line (optional), data provider, date of creation, date of provision, industry sector, domain (optional) and content type or genre.
We have put a lot of effort in providing a clear taxonomy for industry sectors and domains, but we also recognize that it is very difficult for users to always be confident of finding best dataset for their needs. For instance a translation memory of a financial software application may be stored in the ‘financial’ domain or in the ‘computer software’ domain. To address this issue TDA will soon add new features to select the optimal dataset for MT training.
Come On! Join this exciting new movement in training MT systems. Be part of this tremendous growth opportunity.
Suggested Links
TAUS Executive Forum, 19-21 May, Copenhagen
Guidance on data normalization to prepare for MT training
TAUS Data Association - www.tausdata.org




