
Sustained healthy rates of economic growth in many parts of Asia are helping to swell the middle classes in the region. We can expect to see rising levels of demand for translation into and across the region’s languages for sometime to come.
It’s unlikely that we humans alone will have the capacity to satisfy such demand. Machine translation, with all its adequacies, will undoubtedly play a pivotal role in aiding communication and fueling cross border trade.
There are already a number of well established Asian MT companies, including Asia Online, CCID, East Linden, Kodensha, Fujitsu and Toshiba. Add to these European firms such as Pangeanic and Applied Language Solutions and the innovative business models of more recent arrivers Straker Software and Precision Translation Tools.
We can expect a wonderfully exciting and competitive period ahead for these and others with MT offerings. Notably five of the firms mentioned above have the open source toolkit Moses at the heart of their technology stack.
The growing weight of Baidu, the leading Chinese search engine is one more thing that should not be missed in context of Asian MT market. Baidu follows an approach similar to Google’s, oriented on the steady development of data-driven NLP techniques and distributed computing.
With Baidu Box we also may be witnessing the birth of one of the most efficient Chinese products aimed at attracting and retaining users.
But any computational linguist will tell you that there’s a long list of improvements needed to make MT a reliable utility for Asian languages.
Thankfully there’s been tremendous progress achieved by Asian institutions in natural language processing recently. This has been made possible by the increased availability of funds in some Asian countries, as well as a growing consensus that statistical data-driven technologies coupled with traditional linguistic methods are here to stay – even for Asian languages such as Chinese, Japanese and Korean.
The growth in the number of papers published in the ACL1 proceedings and written by researchers based in Asia (i.e. China, Hong Kong, Korea, India and Singapore) from 15 % in 2007 to 28 % in 2011 is one indication of this great trend.
In November 2011 I had the good fortune to participate and speak at the IJCNLP 2011 conference, one of the most significant events in natural language processing. IJCNLP provided an opportunity to observe the convergence of MT-related research and other cross-language technologies coming from China, India, Southern Korea, South-East Asian countries and even from Qatar.
I was happy to see the growing number of papers dealing with different methods to increase the coverage of MT engines for Asian languages. One of the most significant challenges for language pairs that do not involve English is the lack of high-quality parallel corpora. To tackle this issue Asian researchers are investigating a number of synthetic data-based strategies to artificially acquire additional language data/parallel corpora for MT training. The three main approaches are:
1. mining the web to create bilingual corpora;
2. manufacturing data using paraphrasing techniques; and
3. using pivot languages to artificially generate additional parallel data.
First and most obvious way to overcome (or, at least to smooth over) the coverage problem of SMT is to harvest parallel or comparable corpora from the Internet. The paper written by Hong Kong-based researchers (Simon Shi, Pascale Fung, Emmanuel Prochasson, Chi-kiu Lo and Dekai Wu) outlines a parallel document mining system that “improves mining recall by going beyond URL matching to find parallel documents from non-parallel sites”. It was surprising and pleasing to learn that the translation performance of the SMT system that use extracted parallel sentences as a part of the training corpus is very much comparable with translation quality obtained with a Moses-based system employing a manually translated corpus (on a basis of ≈4M sentence pair corpus).
One of the papers that stood out in the context of the paraphrasing approach is a joint work of Harbin Institute of Technology and Baidu (Wei He, Shiqi Zhao, Haifeng Wang, Ting Liu). The paper describes application of paraphrasing techniques to enrich parallel corpora used in SMT. The approach firstly proposed by Callison-Burch (Johns Hopkins University) in 2006, is extended with a sentence novelty feature that helps to select the most novel paraphrase hypotheses added to the parallel corpus. The improvement achievable by the application of this algorithm is about 1 BLEU point for an 8M tokens English-to-Chinese translation. This gain is not significant for industry, but it will definitely stimulate further research on paraphrasing techniques.
An alternative way to paraphrasing is the utilization of pivot languages to craft synthetic training corpora. This is a key technique for many language pairs with scarce resources. It’s also important to know that for the majority of translation scenarios that involve Asian languages and in particular translation from and into Chinese or Japanese much more language data is needed than say for translation from Spanish into English.
Two papers resulting from collaboration between I2R (Singapore) and Spanish and Chinese researchers (Marta R. Costa-jussà, Carlos Henríquez and Rafael Banchs; Ming Zhang, Xiangyu Duan, Ming Liu, Yunqing Xia and Haizhou Li) demonstrate the potential of a pivot combination strategy for Chinese-Spanish and Chinese-Japanese translations when English is used as a pivot language. In both papers, the results reveal almost identical translation quality delivered by direct translation system and the one using English as a pivot. Finally, Michael Paul and Eiichiro Sumita (NIICT, Japan) provided a great talk on factors that should be considered when selecting a pivot language and investigating its impact on the performance delivered.
I was tremendously motivated by the people I met at IJCNLP and the exciting research being done on Asian languages. Almost as soon I got back to Amsterdam I put on my TAUS Labs hat and started building English<>Chinese MT engines using a TAUS Data as the source for parallel corpora. At the time of writing we’ve built 32 different engines using a various data combinations, segmenters and work re-ordering techniques and are looking forward publishing findings in the next few weeks.
- The Annual Meeting of the Association for Computational Linguistics, the leading world-scale conference in computational linguistics and machine translation.↑
| < Prev | Next > |
|---|










Comments
Thanks for your questions.
In the majority of cases, two sides of a bilingual text (called parallel corpus) collected from the web are not accurate translations of each other. Web crawling and spidering are the automated steps to create a corpora. The challenge is how to convert this raw data into a format suitable for an MT engine, including cleaning it. The quality control step of cleaning involves a lot of manual effort. Most corpora for training engines are upwards of a million words. This effort will be hopefully paid off since you will be abel to create a language dataset which matches your own goals.
Modern MT is highly domain-dependent. In the case of data driven systems, this means that feed your engine with lots of of pre-translated data, which is as close as possible in terms of subject matter to the documents you are going to translate in the future. This means you are unlikely to get good results translating technical manuals using the engine trained on, say, fiction books.
Apart from TAUS Data (www.tausdata.org) there are other no major open sources for accessing bilingual corpora. Hence the MT community is inevitably busy working on ways to collect parallel (and monolingual) data from the web.
Is the net the primary "mine" for creation of bilingual corpora? I assume there can be quite some texts on the web, where translation is inaccurate for whatever reason. To prevent the buil-up of such inaccuracies, wouldn't it be more logical to compare texts in books?
Or am i missing the point?
RSS feed for comments to this post