TAUS - Translation Automation

Thursday
Mar 11th
Text size
  • Increase font size
  • Default font size
  • Decrease font size

Putting language data sharing to work


Once an MT solution is in place, the vital strategic need for any translation user is to access good data to train an SMT system, customize resources for a RBMT engine, or achieve advanced leveraging. Hence the recently launched TAUS Data Association (TDA) initiative for pooling and sharing very large translation memory resources in an industry cloud of authoritative content. The TAUS User Conference featured instructive examples of how data sharing can drive better translation automation.


Estimated reading time: 8 minutes
Download the PDF (240 Kb) and read later

 

The more in-domain data the better

Microsoft’s Chris Wendt shared the results of a rich series of experiments examining how using TDA sourced data helps build better-trained SMT systems. The company’s syntactically-informed SMT engine was trained on in-domain ‘IT and computing’ data from a progressively larger set of data sources to evaluate the impact of using domain data from non-MS data providers on the output quality of a trained MS MT engine.

The MS engine includes a series of decoders – or target language models - that apply different weighting for generic, domain, or most specifically company data during training. So the experiment trained the machine with the following series of data:


Total leveraging using TDA shared data

In a third profit of sharing experiment, KCSL’s Ilia Kaufman showed how more data can impact the quality and cost of both human and machine translation, using his NoBabel Enhancer ‘total leveraging’ technology.

This means drawing on all relevant translation units in an entire translation memory to produce a translation of a single segment.


Other data sharing efforts

One of the most prominent data sharing efforts apart from TDA on the radar today is privately-owned Translated.net’s MyMemory, mainly dedicated to providing TM resources rapidly for small and mid market players. The repository has been successful in attracting data, offers a handy search interface for translators, and should reach the 210 million usable segments mark over a wide range of languages by the end of this year.


Maximizing translation memories

Alex Yanishevsky of ProMT explained how standard TMs can be maximally squeezed to deliver greater efficiency and additional benefits in an MT environment, using a “content maximization suite”.

The ProMT system, combining a core rule base with a number of statistical processes, uses TMs as a TMX-file linguistic database to source a number of critical productivity tools.


The data normalization issue

If data sharing is gradually entering the mainstream, one key issue that all players need to address is translation memory quality. Part of the TDA agenda is to deliver a uniquely authoritative – or curated - source of language data, especially as anyone can scrape the web and build parallel corpora of extremely uneven, and therefore unproductive; translation data.

There are two key pain points in data quality – source data from the author, and parallel corpus data in TMs.


Source data improvements

Microsoft expects 5 simple style-guide rules to be applied to its source data to boost engine training and translation: keep sentences short, correct spelling and punctuation, and run the spell and grammar checkers. Other more complex authoring rules did not seem to impact output quality.

In a separate presentation, Andrew Bredenkamp from acrolinx argued that the data must get better as soon as possible for SMT. He showed that multiple variants of a single meaning expression in a corpus can radically undermine SMT training. String-based normalization will eventually need to be supplemented by linguistic tagging to go deeper into source quality issues and help automatic correction systems to learn how to fix the problems. He also argued that new forms of metadata will be needed to tag SMT training data, so that some historical information can be included in the mix to clarify where the data comes from.

OTHER ARTICLES ON TAUS USER CONFERENCE 2009

- Let a thousand MT systems bloom
- Taking the MT decision: selection, build-out and hosting
- Connecting the parts: platforms, communities, standards
- Community building
- Localizing content for Customer Support
- Collective wisdom: Next steps for the industry
 

Events

 

Focus on Asia - Localization Business Innovation

TAUS Executive ForumTOKYO, JAPAN APRIL 14-16, 2010 TAUS Executive Forums are non-spo...

 

EVENT REPORTS - TAUS User Conference 2009

TAUS User Conference 2009 - Events Reports Portland (OR), USA 27-30 October The TAUS U...

 

Localization Business Innovation

TAUS Executive ForumCOPENHAGEN MAY 19-21, 2010 TAUS Executive Forums are non-sponsored ...

Members

 

Janus Worldwide, Konstantin Josseliani

After thorough acquaintance with TAUS' activities,...

 

Symantec, Fred Hollowood

"TAUS acts as a champion in the field of language ...

 

MultiCorpora, Pierre Blais

“As a technology provider, we are evolving rapid...