Once an MT solution is in place, the vital strategic need for any translation user is to access good data to train an SMT system, customize resources for a RBMT engine, or achieve advanced leveraging. Hence the recently launched TAUS Data Association (TDA) initiative for pooling and sharing very large translation memory resources in an industry cloud of authoritative content. The TAUS User Conference featured instructive examples of how data sharing can drive better translation automation.
The more in-domain data the better
Microsoft’s Chris Wendt shared the results of a rich series of experiments examining how using TDA sourced data helps build better-trained SMT systems. The company’s syntactically-informed SMT engine was trained on in-domain ‘IT and computing’ data from a progressively larger set of data sources to evaluate the impact of using domain data from non-MS data providers on the output quality of a trained MS MT engine.
The MS engine includes a series of decoders – or target language models - that apply different weighting for generic, domain, or most specifically company data during training. So the experiment trained the machine with the following series of data:
- first with “general data” as used for the online www.microsofttranslator.com,
- second on Microsoft’s internal parallel data, derived from its product localization operations,
- third on Microsoft and Sybase data (Sybase data being in the IT domain but in insufficient quantities to train a complete system),
- lastly on General + Microsoft + TAUS data.
Each set was tested for BLEU scores, using a subset reserved from the data used in the experiment to measure quality.
A key feature of this experiment was to give proper weighting to the different component data sets. As Chris Wendt put it, it is rather like asking a lot of questions to 3 different people – A, B and C. Over time, you find that A gives the right answer most of the time in subject areas, so you give her more importance than B and C, who may only be right some of the time.
So in training the system, each of the data sets is “asked” for an output, and you then vary the weight of the target language model and data until you get the best possible score for the available data mix. By giving the Sybase data model the highest weighting (i.e. the highest trust as data) with the combined General + Microsoft data + TAUS training set therefore gave the best results (more than 8 BLEU points) when training an engine for a specific domain. In other words, using more data with the best language model delivers the highest results.
One of the additional benefits of Microsoft’s experiments is that there is now a trained system built from all the data from IT companies in the TDA available for use.
See the full article on these findings.
In another experiment on the impact of data sharing,
Intel is running a two stage pilot project using data from the TDA along with its own translation memory resources. The aim is to measure such features as output quality, translation productivity and post-editing effort. These last two parameters are currently in progress.
In the first pilot, the aim was to evaluate the contribution of TDA data to the translation of customer support content. Using advanced leveraging technology rather MT, the data mix comprised Intel’s own TMs together with parallel data sets from the TDA. An analysis was made of all the repetitions, and matches from ‘100%’ to ‘no matches’ that could be found in the two data sources.
The results were mixed. They showed that adding non Intel data did not really change the segment-match profile of the documents when compared to Intel’s own data. This may be partly due to the specific nature of the Intel support content, and the fact that the low word-to-segment ratio meant that such segments could usually be found in the Intel translation memories.
However, Intel believes that drawing on TDA resources would be useful to seed translation memories for new projects where translation memories do not exist for a given company. In TDA, “Intel data is the cake and the rest is the icing”.
A second benefit is that a large TDA-fed data mix would contain the entire history of a company’s TMs, not just the immediate project. Translators can therefore search over a much larger TM space than is available for a single job. Using unified platform such as TDA’s TAUS search, translators have greater visibility over the data universe, and quality assurance reviewers have access to a richer range of language usage. Intel’s Ryan Martin highlighted that TAUS search is also ideal for giving extra context in quick one-off translation jobs in which a single sentence may have been changed.
Total leveraging using TDA shared data
In a third profit of sharing experiment, KCSL’s Ilia Kaufman showed how more data can impact the quality and cost of both human and machine translation, using his NoBabel Enhancer ‘total leveraging’ technology.
This means drawing on all relevant translation units in an entire translation memory to produce a translation of a single segment.
As a result of this sub-sentential processing, virtually all segments get a translation, and output quality can be scored in the same way as for an SMT system.
The data set used in the experiment comprised TM data in the software domain from five TDA members totaling 15 million words. Five scenarios were developed to test data sharing:
- Human translation using no TMs
- Owner-only ‘legacy’ TM1
- TM1 enhanced by NoBabel technology
- TM2 - with additional TMs from the four other companies
- TM2 enhanced by NoBabel
In the first and second scenarios, translators were timed when translating the new source text with and without a TM resource. In the other scenarios, NoBabel was tested with BLEU scores against online Google and Microsoft translation technology.
The results showed that in scenario 1, human translation took 122 minutes versus 76 minutes with TM1 and 73 mins with TM2. A post edited version of scenario 5 took just 61 minutes. In the MT comparison scenarios, Google and Microsoft were close in BLEU scores to the ‘raw’ human translation but performed less well than NoBabel on the automated tasks using broader data mixes.
There were various obvious limitations to this small experiment, especially with regard to the human translator control component, and more tests should be run. But overall, leveraging of some sort (legacy or total) delivered higher quality translations than the ‘raw’ humans, and far greater cost savings and productivity. In turn, total leveraging is more effective than legacy leveraging only, and sharing data from multiple sources boosts speed and productivity.
See the full article on these findings.
The Spanish LSP
Pangeanic also welcomed the availability of data sharing via the TDA, and is set to test various data combinations in its own domains. As one of the earliest translation vendors to embrace SMT, after an initial experience in post editing RBMT output, Pangeanic rolled out its PangeaMT solution to existing clients in Q1 of 2009 on the basis of a Moses system, with technical support from a local university.
Senior Strategy Officer, Manuel Herranz, argued that data sharing is vital to this middle market LSP in its bid to become a comprehensive solution provider. The rich data mix is allowing the company to build very large in-domain TMs for a reasonable cost, and expand the number of languages it can offer. Initial roll out of PangeaMT resulted in productivity gains ranging between 20-30% for FIGS. The addition of TDA data led to further productivity gains of 33-100%. PangeaMT was launched as a new commercial offering at the TAUS User Conference with the special offer of training one MT engine for free for buyers seriously looking to found out more.
Other data sharing efforts
One of the most prominent data sharing efforts apart from TDA on the radar today is privately-owned Translated.net’s MyMemory, mainly dedicated to providing TM resources rapidly for small and mid market players. The repository has been successful in attracting data, offers a handy search interface for translators, and should reach the 210 million usable segments mark over a wide range of languages by the end of this year.
Maximizing translation memories
Alex Yanishevsky of ProMT explained how standard TMs can be maximally squeezed to deliver greater efficiency and additional benefits in an MT environment, using a “content maximization suite”.
The ProMT system, combining a core rule base with a number of statistical processes, uses TMs as a TMX-file linguistic database to source a number of critical productivity tools.
These include methods to shrink and expand segments, support for filters and file types, techniques to flag trustworthy segments, harvest multi-word terms for customer approval, and sub-sentential aligners to build text memories containing both an author memory and a translation memory, and also to build phrase tables for the translation component. There are four steps to maximization.
- Validating the TM to identify misalignments and mistranslations and using the resulting enhanced TM as an improved, trusted resource for further processes.
- Using this TM to automatically generate additional translation units by means of advanced leveraging on sub-sentential strings.
- Extracting customized, domain-specific dictionaries and keywords from the TM to optimize the MT search engine, using statistical sub-sentential alignment tools to deliver a frequency dictionary of lexical items.
- Creating a target language model from this same data set, by choosing the lowest level of perplexity – i.e. the lowest average number of word possibilities at each state in the grammar.
In a test using a 5,000-word sample of Molina Healthcare data from TDA, the hybrid ProMT engine was benchmarked against Google Translate, resulting in similar BLEU scores. ProMT then trained the data with its maximization suite to achieve a significant 46% improvement. In productivity terms, post-editing 1000 words on the baseline engine took 20 minutes, but on the trained engine took only 13 minutes, a 35% reduction in post-editing time. See full article on this study.
The data normalization issue
If data sharing is gradually entering the mainstream, one key issue that all players need to address is translation memory quality. Part of the TDA agenda is to deliver a uniquely authoritative – or curated - source of language data, especially as anyone can scrape the web and build parallel corpora of extremely uneven, and therefore unproductive; translation data.
There are two key pain points in data quality – source data from the author, and parallel corpus data in TMs.
The User Conference looked at both of these in a panel session moderated by Karen Combe of PTC devoted to data cleaning strategies among data providers. Typical issues on the TM side include the excessive number of inline tags, irrelevant bits of data, mistranslations of homonyms, acronyms spelled out in target versions, one into two sentence mismatches, punctuation inconsistencies and upper/lower case mismatches among others all cause avoidable problems in the SMT training process. The question is: how can these be fixed or avoided, ideally with an automated solution?
Intel argued that there were instances that could be cleaned automatically (e.g. trademark codes; formatting, suspect characters, and converting escape sequences back into characters), and others that need to be thrown out, possibly up from 2 to 6% of all segments. The art is to find the sweet spot between adjusting and ditching.
As a contrast, ProMT, a hybrid engine, views such items as irregular characters and incomplete sentences/internal tags as useful data that help understand the text during run-time parses. Post editors also need to see these metadata. So “irrelevant” data are in fact left untouched as they are there for a reason. Everything else can be handled by the dictionary or by grammar rules, including tagged ‘Don’t Translates’.
In a world of very high volume data, Microsoft had no problem with being “pretty liberal about throwing away data at training time”. On the other hand, it called for a standard to handle “factoids” such as numbers and number data inside sentences. TDA could if necessary mask factoid data in TMs. Microsoft finds that data cleaning requires half a person month to update TM resources, and also suggested that its cleaning tools should be shared inside TDA.
Source data improvements
Microsoft expects 5 simple style-guide rules to be applied to its source data to boost engine training and translation: keep sentences short, correct spelling and punctuation, and run the spell and grammar checkers. Other more complex authoring rules did not seem to impact output quality.
In a separate presentation, Andrew Bredenkamp from acrolinx argued that the data must get better as soon as possible for SMT. He showed that multiple variants of a single meaning expression in a corpus can radically undermine SMT training. String-based normalization will eventually need to be supplemented by linguistic tagging to go deeper into source quality issues and help automatic correction systems to learn how to fix the problems. He also argued that new forms of metadata will be needed to tag SMT training data, so that some historical information can be included in the mix to clarify where the data comes from.
OTHER ARTICLES ON TAUS USER CONFERENCE 2009
- Let a thousand MT systems bloom
- Taking the MT decision: selection, build-out and hosting
- Connecting the parts: platforms, communities, standards
- Community building
- Localizing content for Customer Support
- Collective wisdom: Next steps for the industry