TAUS - Enabling better translation

Thursday
May 17th
Text size
  • Increase font size
  • Default font size
  • Decrease font size
Home

How do scientists see the immediate future of translation automation?



TRANSLATION IN THE 21ST CENTURY

Translation in the 21st CenturyThe 2000s may well prove to be the most productive decade for global machine translation research since the 1950s and early 60s. Rumor had it that at that time some $20 million (over $120 million at today’s rates) were devoted to mechanical translation research in the US alone, before government funding was switched off around 1966 following the infamous ALPAC report.

Our industry needs dispassionate, unprejudiced research. Alone it could never fund the depth of inquiry and breadth of trial and error testing required to improve systems and innovate with new models. We are all dependent on this surge of activity to push through to a new generation of marketplace solutions, usually years after the researchers that first invented them have moved on to new challenges. At the same time, this R&D landscape is changing.

In terms of public funding, there are ongoing programs for statistical MT funding by DARPA in the United States, and in Europe via the Seventh Framework Technology program of which the largest is the open-source enabled EuroMatrixPlus project. There are also plenty of other academic MT research projects in many universities and research institutes from Europe to South Africa via China to India. And major IT corporations such as IBM and Microsoft continue to fund natural language processing in general and translation technology projects in particular.

R&D beyond the university

At the same time, much near-market research is also shifting from traditional academic environments or large-scale IT labs to the fast-track world of industrial innovation, as Google’s huge statistical translation effort demonstrates. Cheaper resources and the availability of open source tools are also seeding the emergence of nimble translation automation service partners, sometimes spun off from academic research departments, that carry out R&D for clients seeking faster technology fixes to real world translation problems.

The Moses open source statistical MT toolkit, which is being widely tested by industry, is probably the most significant recent outcome of this concerted activity for the translation industry, and now a symbol of the influence of the data-driven paradigm in both scientific research and the business world. Indeed, the list of academic publications in English alone on statistical MT and related subjects is growing by leaps and bounds, reflecting a new wave of specialization and collaboration, and a welcome focus on sharing results.

Some of these research programs have their sights fixed on near-term prototyping for non-commercial targets in the fields of military intelligence (in the US) or information access for citizens (in the EU). Although the results of these ongoing SMT projects will almost certainly flow down to improve real world MT processes more broadly, there is no clear model of how such benefits might reach the marketplace in an efficient, road-tested way.

One of the key areas for new research is trying to determine how machine knowledge of syntax and semantics can enrich and empower the language models that currently underlie data driven approaches. More researchers will then be likely to return to looking at appropriate architectures of meaning annotation to feed knowledge-rich translation processes.

Overall, the existence of these multiple centers of research interest bodes well for the translation industry as a whole, despite the inevitable rash of false dawns, and dead-ends. The more people there are hypothesizing, testing, and selecting a critical path through alternate models of any aspect of translation will ultimately help us all benefit from the ‘fittest’ survivor. At the same time research funding is finite, so practical benchmarks are needed to offer a competitive environment for testing the results of MT research at a pre-production stage.

To help see how researchers envision the future of translation automation, we asked a number of scientists about their own view of the next decade. Here are five areas where we may (or may not) expect interesting developments:

Language transparency and the rise of transient content

One key development in the strategic role of real-world translation will be the emergence of ‘language transparency’. Another way of saying that (all) linguistic content will be inherently ‘translate-ready’. Users will be able to access content in their own language wherever it comes from, and any access platform will include translation automation by default, be it via a browser of any other application. The translation process for such content will occur invisibly as a switch in the infrastructure.

This in turn will mean that most automated translation will involve ‘transient’ content interactions, such as chat, dynamic content on mobile networks, and social media streams. Such translation activities will be virtually free, require non-optimum quality, and hence occur largely outside the orbit of the translation services industry.

In the interim, what we think of as mainstream high-quality translation requirements (for government, legal, product, strategic, high risk, branded content) will be translated in more or less the same way as today, using a mix of human / MT + post-editing / advanced leveraging.

The advances that will drive the language transparency of textual content will come not from any specific breakthrough in language technology but from infrastructure advances such as higher bandwidth, cloud computing resources, data sharing and intelligent data mining.

Data and resource sharing

Although TAUS Data Association (TDA) and other repositories such as the MyMemory and Google Translate content farms have been accumulating a huge stock of parallel language data, one critical issue in the immediate future will be to make these collections available for the scientists and others who need it to enrich their language models.

A further more recent domain of choice is likely to be the still untapped recordings of bilingual spoken content (e.g. recordings of simultaneous and consecutive interpretations from meetings and conferences) that will help prime the development of real-time speech to speech translation. Part of the agenda for both academic and industrial R&D will therefore be to develop the kind of infrastructure that will make it easy to collect and make this material available as a trustworthy research and production resource.

For production systems, it will be possible to be far more selective about the deployment of data resources. Users will be able to know precisely when very large quantities of data are relevant for a given translation automation task, and when a much more selective range of data will do the trick. In other words, there will be a trend to making data access and usage more intelligent.

Impact of translation automation on the translator community

The general feeling among researchers is that translators will continue to play a central role in production of the high quality translation well into the future. They will also inevitably contribute to the fine-tuning and repairing of MT output as post-editors through the feedback loops that are vital to optimizing MT systems. The gradual build up of postedited texts will then turn into a huge body of potentially decisive training data for MT systems.

There will naturally be more research into ways in which this symbiotic relationship can be optimized within the various types of workflows, with improved toolsets for post-editors. But it seems unlikely that there will be anything more than incremental advances in performance for the industry as a whole. We can expect forward-looking technical translators to adopt new power tools emerging from such research to stay competitive.

Paradigm-changing R&D

Current wisdom has it that there is a small quantity of extremely hard problems to be solved for fully automatic translation, and a larger quantity of less intractable problems in MT that will be solved within the coming decade. The problems that require a theoretical breakthrough - or which turn out to be inherently unsolvable by artificial means - involve conceptual issues in computational linguistics rather than technology issues in real world engineering environments.

The solvable problems are already on the R&D agenda. One is to optimize the handling of languages with complex morphologies or with non-Indo-European word orders, both of which typically make it hard to deliver smooth machine outputs for a number of language pairs. This type of system optimization will almost certainly involve adding annotations to the existing parallel data to help the machine learn more effectively.

As to the old fantasy of the perfect artificial translator, the hypothesis on the table is that a system capable of systematically aping (or even surpassing) a human translator will need to draw on ‘world models’ – real-world knowledge - to overcome the critical quality bottleneck. But it has so far proved impossible to program a machine to understand the semantic intentionality of a text.

Computers can of course be programmed to deploy knowledge of language or of statistical patterns of fluency or of linguistic rules, lexical data, or parallel content. But they cannot access a knowledge base that helps them decide correctly how to disambiguate a given expression in a plausible way in a given context.

Though some scientists will continue to examine different ways of automating more and more of the human translational capacity, most of the effort in this new wave of MT research activity is as we have seen set to focus on the practical results of automation technology.

Building on what has been called ‘the unreasonable effectiveness of data’, most MT scientists believe there is a need for much more abstract language models that can handle the immense complexity and context-sensitivity of linguistic objects, and then use the available data to improve the translation process.

In other words, the gradual accumulation of translation data from industry over the past thirty or so years will be put to work helping the scientists provide techniques for building better production translation systems in exchange. That sounds like a highly productive example of the culture of sharing.


CONTRIBUTORS

Many thanks to the following scientists for contributing their opinions to this article:

Christian Boitet, Université Joseph Fourier, Grenoble
Daniel Hardt, Copenhagen Business School and LanguageLens
Anthony Hartley, Leeds University
Kevin Knight, Information Sciences Institute and University of Southern California
Alon Lavie, Carnegie Mellon University and Safaba Translation Solutions
Joseph Mariani, University of Paris
Andrei Popesco-Belis, Idiap Research Institute, Martigny
Mark Seligman, Spoken Translation Inc.
Khalil Simaan, University of Amsterdam
Gregor Thurmair, Linguatec
Andy Way, Dublin City University and Applied Language Solutions




Русский (Translated by Logrus)

 

Add comment


Security code
Refresh

SUBSCRIBE TO OUR FREE NEWSLETTERS AND ALERTS

Learn about the best translation technologies, open platforms and interoperability, the possibilities of machine translation. Subscribe to our alerts and keep up to date with the latest events, articles and reports.

JOIN OUR MAILING LIST

OTHER TAUS SITES

BECOME A MEMBER

TAUS is an innovation think tank and interoperability watchdog for the translation industry. Membership provides a wealth of benefits. Join TAUS