TAUS - Enabling better automated translation

Friday
Jul 30th
Text size
  • Increase font size
  • Default font size
  • Decrease font size

The New SMT Player Comes from Thailand

Interview with Dion Wiggins, CEO of Asia Online

What would Language Weaver say about a new SMT engine delivering 440 language pairs, due to be rolled out in the next 18 months from an office in Bangkok?

Dion WigginsThis is the highly serious business plan announced by Dion Wiggins, CEO of Asia Online at the LISA Conference in Beijing this week. Currently in the process of launching a portal focussed on Thai language content, Asia Online aims to become the master content owner of South East Asian language content, and purveyor of the largest SMT engine project the world has yet known.

Asia Online's line-up comprises experienced alumni from Gartner Asia Pacific, with a long track record in Asian IT and business, as well as a former senior legal and intellectual property expert from CompuServe, Yahoo! Lycos and MSN heritage. The company is a mix of own and VC funding, with an IPO scheduled for 2011.

One of Asia Online's founders and a shareholder is Philipp Koehn, a well-known academic specialist in SMT from Edinburgh University, who has worked with and on the Pharaoh and Moses open-source SMT systems among others. He is supervising the R&D work on the SMT engine.

The Asian market is an obvious target for MT, explains Wiggins. Although it has a high web usage rate, it suffers from very low penetration, largely due to a desperate lack of local language content. He reckons Asia will source the next billion web users, now that Western web markets are reaching saturation point.

However, Wiggins claims that there are only around 10 million pages online in South East Asian languages such as Thai, Tagalog, Bahasa Indonesian, and Vietnamese. And because there are no vast repositories of parallel texts in translation, it is far harder to kick-start translation using bitexts. "The barrier to entry for translation technology is rising, not falling."

A further concern is Wiggins' contention that the key break on extensive SMT development today is that data being harvested from the web is 70% 'dirty data'. The web is being 'polluted' because users are using existing MT tools to translate their content and this data is being re-published on the web verbatim without proofing and cleaning. "Soon the internet as a source of language data will be non-viable," he says.

A third issue to be solved is how best to combine statistical data with syntax tree support in generating acceptable output in languages such as Chinese. Wiggins claims that adding a syntax component to a SMT system "seriously degrades throughput performance from 5,000 words a minute to only around 300" on a machine with 4 high speed CPUs. This will require further optimization and is more than 2 years away from reaching commercial grade performance.

Asia Online's solution is to start from scratch entering carefully-controlled clean data (often from books, through a network of agreements with large print publishers). The company claims a staff of 150 currently at work inputting and processing Thai and other language texts. The plan is to start with a baseline of 10 million sentence pairs and scale up to around 150 million through a cycle of iterations of generation and quality control so that a new SMT engine can be built from these data.

The near-term plan (4-5 months) is to offer an English-Thai/Hindi/Bahasa Malay/Bahasa Indonesia deck, backed by a variety of translation services. Then over 18 months come the famous 22 European languages and 11 Asian languages in all combinations via a portal, accompanies by a centralized web service entry point for quality enterprise translations in conjunction with language service providers. Services will include web based proofing and, more interestingly, "a real-time user-controlled learning system, so that once a translation error has been corrected, the correction is instantly available to the SMT system and the error is never made again."

Only time will tell whether the industry will have to make what Mark Tapling would call a "conservative estimate" of Asia Online's chances of a breakthrough in this relatively new SMT geography.
 

JOIN OUR MAILING LIST

Reports

 

Postediting in Practice

 

Implement Open Source MT

 

Increase Your Leveraging

Members

 

Pangeanic, Manuel Herranz

A mechanical engineer at a quality assurance depar...

 

McAfee, Paul Walsh

Paul Walsh is the Vice President for Localization ...

 

SDL, Jeremy Harpham

Jeremy Harpham is Senior Product Marketing Manager...