TAUS - Enabling better translation

Thursday
Sep 02nd
Text size
  • Increase font size
  • Default font size
  • Decrease font size

Increasing leveraging from shared industry data

A benchmark from University of Leeds

Bogdan Babych and Tony HartleyTDA has undertaken an assessment of the effects on leveraging translated data when larger sets of translation memories from different company sources are shared. The assessment was undertaken by the Centre for Translation Studies of the University of Leeds.

This pilot project shows that when companies who as yet have no translated data of their own compare their new source text with translation memories from other companies within their industry, they can leverage matches of 80%-100% for up to 16% of the segments in the document. Moreover, a company with its own translated data may find that other memories from the same industry supply, on average, more than half as many exact matches as would a memory of its own.

This pilot project does not provide a practical measurement for translation productivity, but it gives an important benchmark for the TAUS Data Association and all of its members on the potential advantages of the industry-shared language data platform.

"We put a stick in the ground with this benchmark," says Jaap van der Meer, director of TDA. "In the coming months TDA members will be able to run tests and undertake pilot projects with the repository of shared translation memories. The TDA platform already contains more than half a billion words in 70 languages. The industry will have great opportunities to improve on this benchmark when it starts tuning its services and technologies towards the new reality of a super cloud of industry-shared language data."

"Shared data sets and common standards for reporting results are a key feature of the task-based ‘competitions' that have done so much to drive progress in other areas of language processing, such as information extraction and machine translation," add Bogdan Babych and Tony Hartley of Leeds Centre for Translation Studies. "The TDA repository and this pilot could be a first step towards an agreed set of metrics for advanced leveraging."

The report below gives an overview of the tests, scores and considerations.

The tests

The tests were performed with the following materials:

Merged translation memories from five TDA member companies in one language pair (EN-FR), same domain (computer software), similar content type (software strings, documentation):

  • A: 1,517,937 words
  • B: 5,564,996 words
  • C: 2,224,301 words
  • D: 2,850,334 words
  • E: 2,142,196 words

New source documents from three different TDA member companies, English source language, same domain, similar content type:

  • X: 6,999 words
  • Y: 3,669 words
  • Z: 3,718 words

Source documents from four of the five companies supplying the translation memories listed above, English source language, same domain, similar content type:

  • A: 7,759 words
  • B: 28,232 words
  • C: 9,222 words
  • D: 4,508 words

The translation editor used for this pilot is a commercial system that matches both full segments (‘standard leveraging') and in-segment phrases (‘advanced leveraging').

The scores

CTS Leeds conducted two tests. The ‘zero assets scenario' shows the scores for companies who have no translated data to contribute and just leverage against ‘industry TMs'. These scores are the percentage of segments in the source document for which a high-value match is found in the merged translation memory.

Zero assets scenario

The ‘limited assets scenario' is intended to simulate a situation where a company's own memory gives poor coverage, motivating it to leverage against other available translation memories in addition. To achieve this end, we deleted from the merged translation memory, in turn, any exact matches found in the company's own memory in order to artificially cap the assets and establish what proportion of exact matches may come from sharing other companies' data. On average other companies' memories provide more than half of the exact matches a company can expect from its own memory.

Limited assets scenario

The considerations

The following considerations need to be taken into account with the results of this test:

  • The leveraging rates do not necessarily ‘translate' into productivity rates. To what extent the increased leveraging as shown in these tests potentially leads to increases in translation productivity depends on the usability of the leveraged translations as judged by the translator.
  • Different tools give different results for both standard and, especially, advanced leveraging.
  • This test was performed on a fairly broadly defined corpus of translations coming from the computer software industry. Leveraging rates are likely to improve considerably with data from a single industry category and sub-domains, such as CAD-CAM software.
  • The industry in general is only now opening up to the opportunities of large and shared collections of language data. Technologies and solutions will adapt and get better at working with these data.
  • The size of the shared data set in this test was limited to around 15 million words. Further experiments will be needed to explore whether the "more data is better data" principle that is widely evoked in statistical MT is valid for leveraging translation memories.
For more information, please see the TAUS Advanced Leveraging report.
 

JOIN OUR MAILING LIST

Reports

 

Postediting in Practice

 

Implement Open Source MT

 

Increase Your Leveraging

Members

 

CLS Communication, Elisabeth Maier

Chief Technology Officer Dr. Elisabeth Maier is re...

 

McAfee, Paul Walsh

Paul Walsh is the Vice President for Localization ...

 

SDL, Jeremy Harpham

Jeremy Harpham is Senior Product Marketing Manager...