TAUS - Enabling better translation

Thursday
Sep 02nd
Text size
  • Increase font size
  • Default font size
  • Decrease font size

Putting language data sharing to work


Once an MT solution is in place, the vital strategic need for any translation user is to access good data to train an SMT system, customize resources for a RBMT engine, or achieve advanced leveraging. Hence the recently launched TAUS Data Association (TDA) initiative for pooling and sharing very large translation memory resources in an industry cloud of authoritative content. The TAUS User Conference featured instructive examples of how data sharing can drive better translation automation.


Estimated reading time: 8 minutes
Download the PDF (240 Kb) and read later

 

The more in-domain data the better

Microsoft’s Chris Wendt shared the results of a rich series of experiments examining how using TDA sourced data helps build better-trained SMT systems. The company’s syntactically-informed SMT engine was trained on in-domain ‘IT and computing’ data from a progressively larger set of data sources to evaluate the impact of using domain data from non-MS data providers on the output quality of a trained MS MT engine.

The MS engine includes a series of decoders – or target language models - that apply different weighting for generic, domain, or most specifically company data during training. So the experiment trained the machine with the following series of data:


Total leveraging using TDA shared data

In a third profit of sharing experiment, KCSL’s Ilia Kaufman showed how more data can impact the quality and cost of both human and machine translation, using his NoBabel Enhancer ‘total leveraging’ technology.

This means drawing on all relevant translation units in an entire translation memory to produce a translation of a single segment.


Other data sharing efforts

One of the most prominent data sharing efforts apart from TDA on the radar today is privately-owned Translated.net’s MyMemory, mainly dedicated to providing TM resources rapidly for small and mid market players. The repository has been successful in attracting data, offers a handy search interface for translators, and should reach the 210 million usable segments mark over a wide range of languages by the end of this year.


Maximizing translation memories

Alex Yanishevsky of ProMT explained how standard TMs can be maximally squeezed to deliver greater efficiency and additional benefits in an MT environment, using a “content maximization suite”.

The ProMT system, combining a core rule base with a number of statistical processes, uses TMs as a TMX-file linguistic database to source a number of critical productivity tools.


The data normalization issue

If data sharing is gradually entering the mainstream, one key issue that all players need to address is translation memory quality. Part of the TDA agenda is to deliver a uniquely authoritative – or curated - source of language data, especially as anyone can scrape the web and build parallel corpora of extremely uneven, and therefore unproductive; translation data.

There are two key pain points in data quality – source data from the author, and parallel corpus data in TMs.


Source data improvements

Microsoft expects 5 simple style-guide rules to be applied to its source data to boost engine training and translation: keep sentences short, correct spelling and punctuation, and run the spell and grammar checkers. Other more complex authoring rules did not seem to impact output quality.

In a separate presentation, Andrew Bredenkamp from acrolinx argued that the data must get better as soon as possible for SMT. He showed that multiple variants of a single meaning expression in a corpus can radically undermine SMT training. String-based normalization will eventually need to be supplemented by linguistic tagging to go deeper into source quality issues and help automatic correction systems to learn how to fix the problems. He also argued that new forms of metadata will be needed to tag SMT training data, so that some historical information can be included in the mix to clarify where the data comes from.

OTHER ARTICLES ON TAUS USER CONFERENCE 2009

- Let a thousand MT systems bloom
- Taking the MT decision: selection, build-out and hosting
- Connecting the parts: platforms, communities, standards
- Community building
- Localizing content for Customer Support
- Collective wisdom: Next steps for the industry
 

JOIN OUR MAILING LIST

Reports

 

Postediting in Practice

 

Implement Open Source MT

 

Increase Your Leveraging

Members

 

CLS Communication, Elisabeth Maier

Chief Technology Officer Dr. Elisabeth Maier is re...

 

McAfee, Paul Walsh

Paul Walsh is the Vice President for Localization ...

 

SDL, Jeremy Harpham

Jeremy Harpham is Senior Product Marketing Manager...