A TAUS take on the non-proprietary MT landscape
Everyone knows that Moses is the most widely used open source MT system in the translation industry, but it is certainly not alone in the engine platform space. To ring in the New Year 2010, here is a rapid update of some of the better known open MT systems.Apertium is a shallow-parser RBMT engine from Spain that is particularly useful for closely related language pairs such as the various Iberian tongues – Castilian, Valencian, Catalan, Aragonese, Galician and so on. It is actively developed by over 100 registered developers at SourceForge, and also by a broad community in Europe, mostly academic but also from language technology companies such as Promsit.
The Apertium user base includes banks, governments, software industry and NGOs, and is most visible on a number of sites in Spain: the Catalonia government uses it for Catalan-Spanish-Occitan, the La Voz de Galicia newspaper uses it to generate an online Galician version, and universities use it to translate materials for their multilingual student body. Anyone interested in developing MT systems between cognate languages such as Bulgarian and Macedonian, Persian and Tajik, or Hindi and Urdu might wish to investigate Apertium’s capabilities.
OpenLogos is a bit of a UFO in the MT skies and sightings are very rare. This is an open source version of the Logos RBMT system, originally developed in the 1970s to translate English to Vietnamese documentation for the US military. Since then it has morphed through numerous commercial identities in search of more investment, and has been tested and/or used by a broad range of end users. OpenLogos is a freely available version of the core technology, curated by the DFKI in Germany and available under a standard open source license. But is this rich, complex system actually usable?
This system is apparently being downloaded, and the mailing list has 230 subscribers mainly from companies and institutions, possibly because it is the only large scale MT system available on Linux. But the work of building new language pairs requires large financial resources as well as engineer training, and the technology is now out of date and somewhat closed in on itself.
However, the system does have some valuable lexical resources for a number of languages that could be exploited for a superior RBMT engine. It also includes a very rich ontology, whose semantic information could be used to develop various linguistic applications.
For example, one Portuguese language engineer has been working with OpenLogos syntactic and semantic resources to build a “ReWriter” program to control the quality of language based information. Basically it will consist of an interactive process of simplifying, cleaning up or otherwise normalizing text, especially with a view to MT.
Matxin is a Basque twin to Apertium, developed five years ago to handle the non-cognate languages of Spain as part of the government-funded OpenTrad project. A rule based Castilian to Euskera engine came out first, and recently the Basque developer Eleka has released a statistics-based Euskera to Castilian engine.
As for Moses itself, the platform is going from strength to strength and a number of LSPs and translation groups in large companies are exploring its possibilities as industrial strength SMT technology. The aim of the mainly academic community developing Moses is to make it as open and easy as possible – with no sign up or registration. At nearly 4,000 downloads in the past twelve months, it must be the most accessed MT software in the world today.
But what’s next? We can certainly look forward to more sophisticated syntactic capabilities as academic work is transferred to the new builds. Some people are working on multithreaded versions of Moses to take advantage of multicore CPUs; others are adapting engines to operate on mobile devices such as iPhones and netbooks.
Moses will also continue to play a key role in the large-scale EuroMatrix projects designed to spread rapid-development MT for hundreds of language-pairs right across the European linguistic space. The latest MT Marathon is being held in January in Dublin to explore some of the latest research from the Moses community.
More important for potential users will be the emergence of new service suppliers who can help bridge the gap between research and the translation industry, making Moses engines easier to train, deploy and maintain.
To sum up, the current trend seems to track what is happening among proprietary engines, namely growing hybridization. Rule based systems are adding statistical capacity at various points, while SMTs are extending the use of syntactic knowledge to the underlining data-driven approach.
There are of course a number of other fragments or prototypes of open MT systems – a quick scan of SourceForge will demonstrate what the open source community is doing in this field. But relevant open technology for MT users must constitute a genuine platform, not just an engine or a program. This means a technology ecosystem that enables new engines to be built, stable builds to be released, and appropriate maintenance, training, guidance and development. Check out the FLOSS manual on Open Translation Tools for exemplary coverage of the key points in this area.
Many thanks to Anabela Barreiro, Mikel Forcada, Hieu Hoang, Inaki Irazabalbeitia Fernandez, and Walter Kasper for sharing information and ideas.



