
There’s a rapidly growing range of machine translation solutions. Lionbridge and IBM recently entered the market. SDL has strengthened its hand with the acquisition of Language Weaver. Systran, PROMT and other stalwarts have stepped up their games. And middle-tier language service providers are developing their own engines. Some like Applied Language Solutions and Pangeanic are already competing with the likes of IBM and SDL.
So it’s clear you can build and customize systems using vendors or even do it in-house. But how do you assess how well these alternatives perform, whether they are up to scratch, whether they improve over time due to customization and further development?
You need concrete measures to make informed decisions on investments, to calculate ROIs, and to quantify the effectiveness of the alternatives you are considering.
If you are a language service provider or enterprise executive considering MT for the first time, it’s important that you have an awareness of the main distinctions, approaches and hazards. This article is a quick primer to get you started.
Evaluating MT systems is different than human translation quality control
You are probably familiar with applying quality control processes to the translations that you produce (or consume). In fact, the industry has standards that most reputable service providers follow rigorously. Every input segment must be translated correctly. You care about the amount of editing and corrections that your editors and proof-readers make and their quality implications. It's likely that you will be tempted to apply this same methodology when evaluating MT output.
This makes sense in some situations, but it is not in others. It makes sense to use this familiar process to assess and ensure the quality of the resulting human-edited translations, if MT was used as a step along the way in the process. It is less effective to use these same quality procedures to assess the quality of MT in isolation. Most MT output requires some levels of human post-editing. Errors often vary widely from segment to segment.
The error profiles exhibited by MT systems are also very different than those typical of human translators. Consequently, the quality questions you should be focused on when assessing MT are likely quite different. They will often center on whether your MT-modified process leads to overall productivity gains, while maintaining, or even improving, the resulting end quality of your translations.
It is in your interest to learn about the main well-established automated and human measures available for assessing MT output, and to match these with your own goals.
Benchmark evaluation of MT systems is different from MT quality assessment during system operation.
You are likely to be interested in assessing the performance of MT systems in at least two very different ways, which bear a critical distinction. If you are assessing MT technology alternatives, you will want to conduct meaningful benchmark tests, where the MT engines are tested on well-designed test-suites. This is an "offline" evaluation scenario. You should spend the necessary time to design and establish appropriate benchmark data, and test your alternative MT engines in depth on this data.
Most critically, you should extract or create high-quality target human translations for your benchmark test suites. The most commonly used automated evaluation metrics require such "reference" translations, and many human evaluation measures are also based on comparing the MT output to a correct (human) reference translation.
Once an MT engine is deployed, there is a different quality question that you should be on your mind. Since MT systems often vary widely in translation quality from segment to segment, you are likely interested in whether an MT system can assess its own quality during runtime operation, and perhaps flag or filter out poor translations. This too can be an important consideration in selecting between MT engine alternatives.
But you should note that this runtime quality assessment scenario is quite different. In particular, any "quality confidence" scores provided by the MT engine cannot possibly depend on the availability of a "reference" translation, since at runtime, such a reference is surely unavailable.
Runtime quality confidence scores are inherently designed differently than benchmark testing measures. Nevertheless, if detection of poor translations is an important feature for you, you should investigate whether the MT engines you are considering have such a capability, and compare the effectiveness of their confidence estimates, as well as profile engine performance using the confidence scores provided by the engines.
Automated metrics such as BLEU and METEOR do not immediately translate into productivity gains or ROI figures, but are extremely useful for benchmark comparisons
If you've been reading about MT, you've probably heard of BLEU, a commonly used automated metric for assessing MT system performance. BLEU and other commonly used automated measures such as METEOR and TER calculate a score for an MT system on a particular benchmark test set, by comparing the translations produced by the MT system to a human "reference" translation.
While the scores have some correlation with concrete measures, such as productivity gains and/or ROI, this relationship is often not straightforward, making it quite difficult to interpret these scores directly. Nevertheless, automated metric scores are extremely useful for comparing alternative systems on the same benchmark data-set, and for contrasting two versions of a system (i.e. before and after customization). They also serve as critical scoring functions for tuning statistical MT systems for optimal performance.
Human assessments of MT output are costly and somewhat difficult to execute, but can result in the most meaningful types of analyses.
If you need more detailed analysis of performance than the aggregate or segment-level scores provided by the automated metrics, you will likely want to perform a human evaluation of MT output. These types of evaluations can rate quality on an absolute or relative scale, can rank two or more translations produced by alternative systems on the same data, and can profile error characteristics exhibited by the MT system.
Designing such human evaluations is however costly, and great attention should be spent on designing the evaluation task in ways that promote high-levels of agreement across and within judges. Coarser, well-defined quality scales, clear instructions and easy-to-use user interfaces are all critical factors that are likely to produce statistically meaningful results. Attention should also be given to the required qualifications of the human judges.
Several open-source suites of evaluation tools are making it easier to use and leverage the collection of measures and tools outlined above
Conducting meaningful and effective evaluations of MT can be a challenge. While it's likely you will need to do quite a bit of background work before you can embark on running a serious evaluation, several open-source tool suites are now available that can make your life easier. The latest version of METEOR, for example, includes a suite of analysis and visualization tools for aggregating and displaying meaningful statistics after running the metric on a benchmark test set. Both Symantec and Asia Online have also developed open-source tools that make it easier to run various automated metrics, and to display and statistically analyze their results.
Alon Lavie is Associate Research Professor, Carnegie Mellon University, President, Safaba Translation Solutions, and President, Association for MT in the Americas (AMTA)




