The Danish Languagelens System is a statistical MT engine that began as an academic project two years ago and now drives millions of words of English to Danish patent translation at the Copenhagen based LSP Lingtech. Theoretical linguist Daniel Hardt now supervises development at Language Lens.
"The bottom line to the rise of SMT is Moore's Law. As long as computing power grows exponentially, anything linked to it - such as vast amounts of data - will expand in the same way," says Hardt. Languagelens grew out of a research project on integrating SMT with linguistic knowledge, and has been operating commercially since 2007.
Under its current business model, Languagelens is licensing the technology to Lingtech, which has a strong track record in rule based MT, and CorpusMT, a new language technology company, gradually building up a client base in partnership with them.
"Our competitive edge in this field will be in customizing the engine to the special needs of clients who already have 10 million plus words of parallel language data. As soon as you enter a subject matter domain, the data will explode as post-edited output is recycled as automatic training content. Data and quality will always improve over time."
For Daniel Hardt, the key to success with SMT is to first get above the level of quality where post-editing takes longer than human translating. The inherent quality improvement cycle based on retraining with post-edited content will continually reduce the post-editing step. He notes that this is not something that the Google MT service can deliver.
"When I talk MT quality output to clients, some say ‘It takes my translator 15 minutes. If it takes less time with your system, the quality is good; otherwise it is bad. ." In the case of the company's work on patents, the capacity to leverage a 10 million word corpus means that eventually the output only needs very light post-editing.
Apart from general ingrained skepticism about MT in the public mind - and among linguists - Hardt notes that potential end users usually understand the logic of recycling their existing language data for an automated solution.
"People are understandably concerned about questions of data security. But one cannot reconstruct a complete text from the data in an SMT system, so the data remains confidential. But any effort to share data will help prime the pump for more MT throughput."
The TAUS Take:
Languagelens is a good example of the rapid commercialization of an SMT system customized to a data-rich domain. It is proving far more cost-effective and productive than the (admittedly very old) rules based system previously implemented for the same task. The developers themselves were surprised by the speed and ease with which an MT system could be put together and harnessed to a translation process. Expect to see many more such deployments competing for new language pairs in vertical markets.




