TDA members' pilot project prove the benefits of sharing translation memories

At a glance
- Sharing translation memories delivers significant weighted word count reductions in advanced leveraging tests undertaken by TDA member companies, with an average reduction of 43% across all scenarios before translator assessment
- Organizations with small or no translation memories in specific language pairs and/or domains are likely to gain most from using industry shared language data for advanced leveraging. The average weighted word count reduction for this scenario is 35% before translator assessment
- TDA members who are not currently using advanced leveraging are already actively looking to adopt this rapidly-developing technology
- TDA is reviewing data download rules to better cater for the needs of organizations with small or no translation memories
"It's great for TDA and its members that we are already fulfilling the promise of catalyzing innovation for the industry. Sharing translation memories on the industry-owned TDA platform proves to be very beneficial in combination with scalable translation technologies."
Jaap van der Meer, director TAUS Data Association
TDA members Jonckers, KCSL, Lionbridge, Lingotek, Milengo, MultiCorpora, and Welocalize have run independent tests during June and July to benchmark leverage using shared translation memories from TDA. Classic segment leveraging shows a marginal benefit, usually ranging between 3-5%. Significant gains are made when advanced, in segment, leveraging using shared translation memories from TDA.
The two approaches
Advanced leveraging tests were undertaken by three companies. Each used the same data set, industry-shared corpora, as used for the benchmark study undertaken by the Centre for Translation Studies (CTS) at University of Leeds, UK, which showed a leverage of 80-100% matches on an average of 19% of segments in three source documents for companies with no translation memories of their own.
A repeat of the original benchmark study
One of the three companies reran the CTS test, reporting a leverage of 75-100% matches on an average of 31% of segments in three source documents for companies with no translation memories (TMs), to confirm the repeatability of the initial study and provide contrastive metrics.

Word count oriented test
Two companies used a weighted word count approach for their analyses and reporting. This approach provides a more commercially relevant and time-accurate indication of the possible translation efficiency gains than the original benchmarking approach.
As with the CTS study, two scenarios, (1) zero assets and, (2) limited assets were run. The first quantifies potential benefits for companies with no translation memories of their own and the second assesses potential gains for companies using their own translation memories in conjunction with the merged memories of other TDA members. The remainder of this report outlines the results of using the weighted word count approach.
The dataset
Merged translation memories from five TDA member companies in one language pair (EN-FR), same domain (computer software), and of a similar content type (software strings, documentation):
- A: 1,517,937 words
- B: 5,564,996 words
- C: 2,224,301 words
- D: 2,850,334 words
- E: 2,142,196 words
Word count weightings

How the weightings are applied
For example, a 1,000 word source document might be considered as translating a 490 document at the full rate per word.
Weighted word count = (0.1 x 350 (words 100% match)) + (0.3 x 250 (words 95-99% match)) + (0.6 x 50 (words 85-94% match)) + 150 + 200 = 490
Results of word count tests
Advanced leveraging of shared translation memories led to an average 43% word count reduction across all scenarios and documents before translator assessment.
The table below shows average reduced (weighted) document word counts expressed as absolutes and percentages for the two TDA members that undertook the tests.
Shared TMs: Advanced leveraging (AL) with zero language assetsAverage 35% word count reduction, across three documents for the two TDA members that under took these tests, before translator assessment.

Average 49% word count reduction, across four documents for the two TDA members that under took these tests, before translator assessment.

Considerations
- The leveraging rates/word count reductions do not necessarily ‘translate' into productivity gains. The extent to which the increased leveraging/word count reductions shown in these tests lead to increases in translation productivity depends on the usability of the leveraged translations as judged by translators
- These tests were performed on a fairly broadly defined corpus of translations coming from the computer software industry. Leveraging rates are likely to improve with data from a single industry category and sub-domains
- The industry in general is only now opening up to the opportunities of large and shared collections of language data. Technologies and solutions will adapt and get better at working with these data
- The size of the shared data set in this test was limited to around 15 million words. Further experiments will be needed to explore whether the "more data is better data" principle that is widely evoked in statistical machine translation is valid for leveraging translation memories
Future reports
- Early September - results of machine translation training tests using industry shared TDA data
- Mid-September - Detailed report for TDA members on lessons learnt during tests
- End-September - Second set of results for machine translation training as well as advanced leveraging test with post-editing
- Mid-October - Advanced leveraging report for TAUS and TDA members
- 28-30 October - TAUS User Conference on the "Profit of Sharing"


