TAUS - Translation Automation

Tuesday
Nov 18th
Text size
  • Increase font size
  • Default font size
  • Decrease font size

Languagelens: from PhD project to dedicated patent MT service

E-mail Print

Daniel HardtThe Danish Languagelens System is a statistical MT engine that began as an academic project two years ago and now drives millions of words of English to Danish patent translation at the Copenhagen based LSP Lingtech. Theoretical linguist Daniel Hardt now supervises development at Language Lens.

"The bottom line to the rise of SMT is Moore's Law. As long as computing power grows exponentially, anything linked to it - such as vast amounts of data - will expand in the same way," says Hardt. Languagelens grew out of a research project on integrating SMT with linguistic knowledge, and has been operating commercially since 2007.

 

Under its current business model, Languagelens is licensing the technology to Lingtech, which has a strong track record in rule based MT, and CorpusMT, a new language technology company, gradually building up a client base in partnership with them.

"Our competitive edge in this field will be in customizing the engine to the special needs of clients who already have 10 million plus words of parallel language data. As soon as you enter a subject matter domain, the data will explode as post-edited output is recycled as automatic training content. Data and quality will always improve over time."

For Daniel Hardt, the key to success with SMT is to first get above the level of quality where post-editing takes longer than human translating. The inherent quality improvement cycle based on retraining with post-edited content will continually reduce  the post-editing step. He notes that this is not something that the Google MT service can deliver.

"When I talk MT quality output to clients, some say ‘It takes my translator 15 minutes. If it takes less time with your system, the quality is good; otherwise it is bad. ." In the case of the company's work on patents, the capacity to leverage a 10 million word corpus means that eventually the output only needs very light post-editing.

Apart from general ingrained skepticism about MT in the public mind - and among linguists - Hardt notes that potential end users usually understand the logic of recycling their existing language data for an automated solution.

"People are understandably concerned about questions of data security. But one cannot reconstruct a complete text from the data in an SMT system, so the data remains confidential. But any effort to share data will help prime the pump for more MT throughput."

The TAUS Take:

Languagelens is a good example of the rapid commercialization of an SMT system customized to a data-rich domain. It is proving far more cost-effective and productive than the (admittedly very old) rules based system previously implemented for the same task. The developers themselves were surprised by the speed and ease with which an MT system could be put together and harnessed to a translation process. Expect to see many more such deployments competing for new language pairs in vertical markets.

 

Meetings

 

TAUS World Tour

TAUS goes on a World Tour with Round Tables in 24 cities on 5 continents to brainstorm inn...

 

Zurich, November 27, 2008

Venue: to be announced Program: see TAUS World Tour article Register...

 

Barcelona, December 10, 2008

Venue: to be announced Program: see TAUS World Tour article Register ...

Members

 

Arabize, Manal Amin

Arabize was founded in 1994 as an independent privately held service company located in Eg...

 

PTC, Karen Combe

"TAUS provides a valuable service to the localization industry by disseminating informati...

 

MultiCorpora, Pierre Blais

“As a technology provider, we are evolving rapidly. We began with a corpus-based CAT too...