Major translation buyers, language service providers and even commercial machine translation vendors are increasingly using open source MT solutions as part of their translation toolkit. By far the most wide used system is Moses, a statistical machine translation engine.
The report's author Achim Ruopp also leads a a half day TAUS workshop on this topic.
This report will equip you with the know-how to make informed decisions about how to implement open source MT solutions, with a focus on the Moses system. It covers the full cycle from data preparation to retuning your engine to integration with your translation workflow.
As a translation provider or translation buyer you have valuable linguistic assets in the form of translation memories. Learn how to combine these assets with translation data from other sources to build highly productive statistical machine translation (SMT) systems.
The report provides you with background knowledge on SMT that will help you understand its strength and weaknesses. Learn how to address the weaknesses with hands-on best practices.
The report also contains an overview of the open source ecosystem of Moses, which allows you to objectively evaluate how this MT system fits into your existing processes and how Moses compares to commercial solutions.
If you are an IT manager or budget holder considering / in the early stages of using MT in your mainstream translation business, this report is a must for you.
Contents
1. Why Open Source?
2. Overview of Open Source Machine Translation
2.1 Approaches to Machine Translation
2.2 Principles of Statistical Machine Translation
3. Data sources – making the most of your translation memories
3.1 Parallel corpus for translation model training
3.1.1 What is a parallel corpus?
3.1.2 Parallel corpus format
3.1.3 Parallel corpus sources
3.1.4 Defining the expected quality: tuning data and evaluation data
3.2 Mono-lingual corpus for language model training
4. The Moses statistical machine translation system and associated components
4.1 Moses and required 3rd party components
4.1.1 Moses system components
4.1.2 Operating system
4.1.3 Hardware requirements
4.2 Deploying Moses
4.2.1 Basic setup
4.2.2 Performance considerations and clustering
4.2.3 Moses on Amazon EC2
5. Training, tuning and translating with Moses
5.1 MT system training with Moses
5.1.1 Tokenization
5.1.2 Lowercasing and the recaser model
5.1.3 Training of language model and translation model
5.1.4 Tuning the MT system
5.2 Translating text with a trained Moses system
5.3 System Evaluation and Tweaking
5.4 Iterative improvement of the MT system
6. Best practices
6.1 Input file formats
6.2 Corpus cleaning
6.2.1 Basic corpus cleaning
6.2.2 Linguistic corpus cleaning
6.2.3 Handling named entities
6.3 Language specific considerations
6.3.1 Tokenization for east asian languages
6.3.2 Morphologically rich languages
7. Integrating Moses into a localization workflow
7.1 Offline integration
7.2 Online integration
7.3 Postediting machine translations
7.3.1 Postediting environments
7.3.2 Postediting metrics
7.3.3 Productivity
7.3.4 Quality
7.3.5 Translator acceptance
8. Getting support and supporting the community
8.1 Getting support
8.2 Reporting bugs
8.3 Opportunities to contribute to and build upon Moses
8.3.1 Commercial support, supported releases and consulting services
8.3.2 Filling in the gaps for commercial use
8.4 Staffing recommendations
9. Resources
10. Bibliography
Useful Links
TAUS Reports Calendar 2010
TAUS Events Calendar 2010