TAUS - Enabling better translation

Wednesday
Feb 22nd
Text size
  • Increase font size
  • Default font size
  • Decrease font size
Home > Publications > Technology > Machine translation and Asian languages

Machine translation and Asian languages

E-mail Print


article

Sustained healthy rates of economic growth in many parts of Asia are helping to swell the middle classes in the region. We can expect to see rising levels of demand for translation into and across the region’s languages for sometime to come.

It’s unlikely that we humans alone will have the capacity to satisfy such demand. Machine translation, with all its adequacies, will undoubtedly play a pivotal role in aiding communication and fueling cross border trade.

There are already a number of well established Asian MT companies, including Asia Online, CCID, East Linden, Kodensha, Fujitsu and Toshiba. Add to these European firms such as Pangeanic and Applied Language Solutions and the innovative business models of more recent arrivers Straker Software and Precision Translation Tools.

We can expect a wonderfully exciting and competitive period ahead for these and others with MT offerings. Notably five of the firms mentioned above have the open source toolkit Moses at the heart of their technology stack.

The growing weight of Baidu, the leading Chinese search engine is one more thing that should not be missed in context of Asian MT market. Baidu follows an approach similar to Google’s, oriented on the steady development of data-driven NLP techniques and distributed computing.

With Baidu Box we also may be witnessing the birth of one of the most efficient Chinese products aimed at attracting and retaining users.

But any computational linguist will tell you that there’s a long list of improvements needed to make MT a reliable utility for Asian languages.

Thankfully there’s been tremendous progress achieved by Asian institutions in natural language processing recently. This has been made possible by the increased availability of funds in some Asian countries, as well as a growing consensus that statistical data-driven technologies coupled with traditional linguistic methods are here to stay – even for Asian languages such as Chinese, Japanese and Korean.

The growth in the number of papers published in the ACL1 proceedings and written by researchers based in Asia (i.e. China, Hong Kong, Korea, India and Singapore) from 15 % in 2007 to 28 % in 2011 is one indication of this great trend.

In November 2011 I had the good fortune to participate and speak at the IJCNLP 2011 conference, one of the most significant events in natural language processing. IJCNLP provided an opportunity to observe the convergence of MT-related research and other cross-language technologies coming from China, India, Southern Korea, South-East Asian countries and even from Qatar.

I was happy to see the growing number of papers dealing with different methods to increase the coverage of MT engines for Asian languages. One of the most significant challenges for language pairs that do not involve English is the lack of high-quality parallel corpora. To tackle this issue Asian researchers are investigating a number of synthetic data-based strategies to artificially acquire additional language data/parallel corpora for MT training. The three main approaches are:

1. mining the web to create bilingual corpora;
2. manufacturing data using paraphrasing techniques; and
3. using pivot languages to artificially generate additional parallel data.

First and most obvious way to overcome (or, at least to smooth over) the coverage problem of SMT is to harvest parallel or comparable corpora from the Internet. The paper written by Hong Kong-based researchers (Simon Shi, Pascale Fung, Emmanuel Prochasson, Chi-kiu Lo and Dekai Wu) outlines a parallel document mining system that “improves mining recall by going beyond URL matching to find parallel documents from non-parallel sites”. It was surprising and pleasing to learn that the translation performance of the SMT system that use extracted parallel sentences as a part of the training corpus is very much comparable with translation quality obtained with a Moses-based system employing a manually translated corpus (on a basis of ≈4M sentence pair corpus).

One of the papers that stood out in the context of the paraphrasing approach is a joint work of Harbin Institute of Technology and Baidu (Wei He, Shiqi Zhao, Haifeng Wang, Ting Liu). The paper describes application of paraphrasing techniques to enrich parallel corpora used in SMT. The approach firstly proposed by Callison-Burch (Johns Hopkins University) in 2006, is extended with a sentence novelty feature that helps to select the most novel paraphrase hypotheses added to the parallel corpus. The improvement achievable by the application of this algorithm is about 1 BLEU point for an 8M tokens English-to-Chinese translation. This gain is not significant for industry, but it will definitely stimulate further research on paraphrasing techniques.

An alternative way to paraphrasing is the utilization of pivot languages to craft synthetic training corpora. This is a key technique for many language pairs with scarce resources. It’s also important to know that for the majority of translation scenarios that involve Asian languages and in particular translation from and into Chinese or Japanese much more language data is needed than say for translation from Spanish into English.

Two papers resulting from collaboration between I2R (Singapore) and Spanish and Chinese researchers (Marta R. Costa-jussà, Carlos Henríquez and Rafael Banchs; Ming Zhang, Xiangyu Duan, Ming Liu, Yunqing Xia and Haizhou Li) demonstrate the potential of a pivot combination strategy for Chinese-Spanish and Chinese-Japanese translations when English is used as a pivot language. In both papers, the results reveal almost identical translation quality delivered by direct translation system and the one using English as a pivot. Finally, Michael Paul and Eiichiro Sumita (NIICT, Japan) provided a great talk on factors that should be considered when selecting a pivot language and investigating its impact on the performance delivered.

I was tremendously motivated by the people I met at IJCNLP and the exciting research being done on Asian languages. Almost as soon I got back to Amsterdam I put on my TAUS Labs hat and started building English<>Chinese MT engines using a TAUS Data as the source for parallel corpora. At the time of writing we’ve built 32 different engines using a various data combinations, segmenters and work re-ordering techniques and are looking forward publishing findings in the next few weeks.

  1. The Annual Meeting of the Association for Computational Linguistics, the leading world-scale conference in computational linguistics and machine translation.
 

Comments  

 
+1 #2 Maxim Khalilov 2012-01-24 16:48
Hi Kiril,

Thanks for your questions.

In the majority of cases, two sides of a bilingual text (called parallel corpus) collected from the web are not accurate translations of each other. Web crawling and spidering are the automated steps to create a corpora. The challenge is how to convert this raw data into a format suitable for an MT engine, including cleaning it. The quality control step of cleaning involves a lot of manual effort. Most corpora for training engines are upwards of a million words. This effort will be hopefully paid off since you will be abel to create a language dataset which matches your own goals.

Modern MT is highly domain-dependent. In the case of data driven systems, this means that feed your engine with lots of of pre-translated data, which is as close as possible in terms of subject matter to the documents you are going to translate in the future. This means you are unlikely to get good results translating technical manuals using the engine trained on, say, fiction books.

Apart from TAUS Data (www.tausdata.org) there are other no major open sources for accessing bilingual corpora. Hence the MT community is inevitably busy working on ways to collect parallel (and monolingual) data from the web.
Quote
 
 
0 #1 Kirill 2012-01-19 18:30
A question from someone far-far away from the field.

Is the net the primary "mine" for creation of bilingual corpora? I assume there can be quite some texts on the web, where translation is inaccurate for whatever reason. To prevent the buil-up of such inaccuracies, wouldn't it be more logical to compare texts in books?

Or am i missing the point?
Quote
 

Add comment


Security code
Refresh

RECENTLY PUBLISHED

Machine translation and Asian languages
Expect to see rising levels of demand for translation into and across the region’s languages for sometime to come.


This CEO is here to stay
First article in the sustainable growth series.


The nuts and bolts of self-service MT
Finding the right tools for self-service MT implementation is often a challenge.


Interoperability and open tools
Attendees rose to the challenge of developing user-friendly solutions to implement or work around standards of efficiency and savings at the TAUS User Conference 2011.


Translation quality evaluation is catching up with the times
Quality is when the buyer or customer is satisfied. In the translation industry, quality measurement  is managed by quality gatekeepers.


MT spells mainstream translation
MT as usual served up some interesting new developments at the TAUS User Conference this year in Santa Clara.


The future for translators looks bright, but they will have to reinvent the profession first
Seven predictions and a survey presented at the 19th FIT Conference, San Francisco, August 2011


What machines still can't translate
The breakthroughs presented at the Annual Meeting of the Association for Computational Linguistics often define the future of computational linguistics for years to come.


NEWS

TAUS and SKTOL announce partnership
17th February, 2012, Amsterdam – TAUS, the translation innovation think tank and interoperability watchdog, announces a partnership with SKTOL (Association of Finnish Translation Companies).


Powerful academia-industry partnership formed with launch of MosesCore project
9th February, 2012, Amsterdam – TAUS, the innovation think tank and interoperability watchdog, announces the launch of the MosesCore project on behalf of consortium partners. The open source statistical machine translation toolkit, Moses, is rapidly becoming an indispensable tool for coping with globalization’s linguistic diversity.


TAUS announces partnership with APET
2nd February, 2012, AmsterdamTAUS, the translation innovation think tank, announces a partnership with APET (Portuguese Association of Translation Companies).


TAUS ASIA TRANSLATION SUMMIT

beijing-box

 Beijing, April 24 – 25, 2012
In Co-operation with CCID and Translators Association of China

TAUS TOKYO EXECUTIVE FORUM

tokyo-box

 Translation in the 21st Century 
Tokyo, April 19 – 20, 2012

JOIN OUR MAILING LIST

OTHER TAUS SITES

TRANSLATION AUTOMATION TIMELINE

At TAUS we're forward-thinking. Which means we try to know our history. So explore with us the story of translation automation in the digital age. See timeline