Early MT
Although there has been a parallel tradition of academic research in computational linguistics (using computers to analyze and explore various aspects of human languages), the real engine driving the growth of language technology has been the search for economic or geostrategic performance. Those earliest attempts at translation technology in the 1950s focused on automating the translation of Russian to English technical material related to the space race and its military implications. Later in the 1970s during the Vietnam War, there was a spurt of activity to develop an English to Vietnamese system to speed up the translation of weapons documentation. By the 1980s when the consumer market for computing started to open up, the interest switched to translating or adapting actual software products and their accompanying volumes of documentation into any language that offered a suitable market - a process we now call localization. We were on the cusp of an age of mass multilingual computing.
Even in those early, experimental days, the limitations of fully automated translation were quickly recognized and the only commercial system that succeeded in emerging from the many thousands of man-hours devoted to R&D was what is now known as Systran. When the localization boom began, and the range of target languages for translation exploded from English and a few major European languages to upwards of twenty or thirty languages and their variants (e.g. different types of French, Chinese, English and Spanish) worldwide, there was tremendous pressure to harness computer power to help the translation process.
However, due to the ‘natural' complexity of human languages, there was no easily scalable solution to the translation automation challenge. Small parts of languages could be ‘computerized' successfully - for example, dictionaries of words or expressions that could be looked up in two or more languages, or very constrained fields of sentence analysis (grammar) for very special text types. However, trying to extend this capability to much larger volumes of text in many more languages proved extremely difficult without laborious post-editing. For immediate needs, then, the better solution was to develop a judicious combination of computer and human translator power - sometimes called ‘computer assisted translation' or CAT - by localization practitioners in the work place. Database tools such as terminology management systems, and later translation memories, a vital step forward for the commercial growth of translation, were developed in the late 1980s and early 1990s. These offered effective productivity gains to translators, and greater control over process management for buyers of translation services. The ‘language industry' with its network of products and services, suppliers and buyers, was born.
Europe's language barrier
In Europe at this time, the recognition of the multilinguality problem of the Community - 11 languages used on a theoretically equal basis among 15 nations, and another ten were due to be added to reach today's 25 - spurred the launch of a massive translation automation project involving all possible language pairs. Called EUROTRA, this project was over-ambitious and never achieved its declared goal, but in-depth activity over 6 years did help kick-start a wide range of useful language technology programs in different countries, from Portugal to Denmark, and from Greece to the Netherlands.
This meant that basic tools - word analyzers (lemmatizers) and sentence parsers - were developed for many languages, large volumes of digital language resources (such as tagged texts and sometimes parallel corpora of translated texts) were created and standardized, and dictionaries that could be read by computers were compiled. So even if EUROTRA died an early death, language technology in Europe lived on to provide the language ‘infrastructure' of researchers, consultants, practitioners and systems that drives multilingual communication today.
The United States was not, of course faced by the same urgent need to inter-translate in multiple languages, so R&D there tended to focus more on such topics as advanced information searching, and also speech technology. Using basic but essential computer language tools such as thesauri (WorldNet), primitive semantic networks, and shallow grammar parsers, the agenda included such issues as summarizing multiple documents in a single, brief but relevant text, or in a similar way, extracting specific information from vast streams of news data.
Competitive innovation
The overall drivers were partly commercial, with the growing strategic importance of business intelligence, and partly security concerns, which were to reach a climax of course after September 11 2001. At the same time, the huge increase in translation from U.S. English for markets abroad led to an interest in special controlled authoring languages at Perkins, Kodak, Caterpillar, Boeing and others. Some of these tried to develop software solutions that ensured compliance with corporate or industry writing rules, simplified international readability (i.e. by people whose native language wasn't English) and rendered certain types of texts consistent enough to facilitate automatic translation.
A further vital stimulus to language technology innovation during the 1980s was the rise of the Japanese Fifth Generation program, a wide-ranging endeavor that aimed to make computers generally more intelligent. Natural language processing and ‘understanding' were key to this ambitious initiative, which included research into such practical applications as spoken translation systems. The US and Europeans both reacted to this competitive threat by launching major new R&D initiative in speech (having computers recognize and produce spoken language) and language engineering: the DARPA/NSF (USA) and Information Society Technology (EU) programs. It was from these efforts that the first language and speech technology companies were spun off in the 1990s, launching the market for robust language software.
A multilingual Internet
The rise of the Internet, the massive use of email and the accelerating exchange of information over the Web naturally gave a further boost to the nascent language industry by suddenly positioning multilinguality as a consumer and not just business experience. Global Web usage meant more translation, more document searching, and as a result more resources such as dictionaries and search engines to master the glut of online information. By giving more people the ability to publish and market their applications, the web enabled literally dozens of online machine translation sites.
Some of these service sites are downsized versions of existing machine translation systems such as Systran or ProMT (which specializes in Russian and other European languages). Others are simply automated dictionary lookups for various pairs of languages, and others again are Open Source or homemade efforts. While few of these sites offer truly innovative technologies, and output quality is usually very low, their combined presence has had the effect of transforming the very idea of translating a web site or a document from a specialist (and therefore rare) skill into a rapid multilingual information search that anyone could use. The tone was set for language processing to become a "Click" functionality in the digital mindset.
During the mid to late 1990s, the language industry adapted to the new global commercial order of Websites and online ‘content' by streamlining the translation process, developing multilingual content management systems, creating standards for exchanging translation memories and, in general, adopting the networked organization as a model for managing processes in a more time and cost efficient way. Just as translation memories, containing legacy translations of strategic documents, began to be perceived as a corporate asset for far-sighted companies, researchers were coming to realize that the existence of vast amounts of digital text in various languages constituted a vast mine of raw material for language processing experiments.
The automatic translation agenda therefore began to shift from the laboriously hand-crafted rule-based systems that had been developed from the earliest years of computing into rapid fire data-driven approaches, where the machines themselves could exploit existing patterns in language to find plausible translation equivalents. The era of statistical translation backed by machine learning techniques had begun, first in the R&D labs and now in commercial products (such as Language Weaver and Linear B, and in a different register ESTeam) and services.
The emergence of a language tech market
The language technology market today is best understood as a segment of the global software industry. In Europe, a recent survey made by the EU counted some 300 businesses involved more or less directly in developing and marketing language processing software of some sort. Most of these firms are very small start-ups, usually leveraging academic research and national development programs to create innovative software products. They are inevitably fragile, under-capitalized and present a high risk profile to buyers seeking sustainable language support tools.
UK, Germany, the Netherlands, France, Italy and Finland seem to be the most active countries in this sector. Spain tends to be fragmented into regional development centers, and the greater opportunity offered by Latin America as a potential market for NLP-driven tools for Spanish language technology has not yet been fully grasped. Russia has traditionally been a hotbed of theoretical computational linguists, and has recently managed to leverage some of its extensive earlier research work into practical applications. There are also research bases throughout Eastern, Central and Baltic Europe, but almost no competitive commercial applications yet.
Consolidation
Although theoretically ripe for consolidation, due to the existence of a large group of small companies competing for a small group of large customers, this market has not yet cultivated many truly multinational players. The French text mining firm TEMIS, which in 2003 acquired Xerox's natural language suite (parsers and term extraction tools) is one of the few truly European firms in this space. In speech, the Belgian firm Acapela was created by the merger of Babel Technologies, Elan Speech and Babel Infovox in 2004. Comprendium, a German-UK knowledge management firm that has changed hands more than once, has European scope, and includes a revamped translation technology division. But its original technology derives from a SAIL Labs version of an MT system first developed back in the 1980s. In the translation memory or globalization content management market, Trados is one of the few notable brands. Among interested customers, SAP is both a major user of language technology and a developer of proprietary translation automation solutions, essentially driven by its bilingual German-English documentation heritage. A number of European banks in Switzerland and Belgium have explored translation automation solutions, and French auto-maker Renault among others has implemented a translation automation solution for its global intranet.
The European Commission uses a version of Systran as a gisting engine integrated into a comprehensive EURAMIS workflow system. And over the past 15 years, many enterprises with multilingual documentation needs have experimented with translation automation and knowledge mining solutions that deploy genuine NLP techniques. However, long term commitment to these solutions appears to have been very rare, and useful information on best practices has remained somewhat discreet.
Now that NLP technologies have reached sufficient maturity to form components of larger systems, the cycle of consolidation has begun. Partnerships and buy-outs are gradually embedding multilingual capabilities in the complete content value chain. One recent example is N-Stein's purchase of Alis Technologies, a Canadian multilingual content solutions player that has been providing translation automation services for over a decade. Another, in a different register, is IAC-/InterActive's purchase of the "natural language" search engine AskJeeves.
Key processes where such tools can make a difference are business intelligence and knowledge management, where language empowerment is used to improve the accuracy and relevance of searches. Another major business area is that of ‘speech technology' products, usually targeting the call center, mobile phone and audio mining market that are also starting to use language type tools to improve performance. On the other hand, in a recent French survey, translation tools played a much smaller role (5%) in the total market place. While the global outsourceable translation market is worth around US$ 5-10 billion, a much smaller share of this - around US$ 100 million - is thought to be earned by translation automation suppliers.
In the U.S., there are fewer companies involved in the various segments of the language industry, and they are certainly more centered on English language processing. However, Inxight (originally a Xerox start-up) probably stands out as one of the key players, along with Basis Technologies, both of which are also multilingual in orientation. For many of these players, the emphasis is on providing natural-language intelligence tools (language identifiers, simple parsers to recognize entities and actions, lemmatizes to simplify word recognition, etc) to upgrade business process performance in middleware pro-ducts, CRM applications, email management and enterprise content management in general.
It would, however, be wrong to think of language technology as exclusively associated with a start-up culture. The software industry's legacy majors such as Microsoft, IBM and Oracle have all invested deeply in industry scale NLP applications, including multilingual searching and knowledge management, and translation automation, and they run some of the most advanced near-to-market R&D teams in the world. Their large scale database and middleware offerings will increasingly embed a variety of language technology features that will go mainstream across all business process software applications.
As semantics driven multilingual searching and some form of translation gradually evolve into core tasks on the desktop and within the corporation, there is every likelihood that they will be-come embedded deep in the operating systems or networking applications that define work, personal computing, entertainment and gaming right around the world. This means that such large players are well positioned to control this much-sought after real estate.
Customers for these language technologies in all countries include large businesses in the life science, financial and high technology sectors with very large (hundreds of millions of searchable objects) text and knowledge mining or large document translation requirements, but also government and security services that require real time search, extraction and often translation of very large streams of text and audio records. At the same time, the boom in mobile telephony and computer games is raising interest in various language technology type applications to improve the interface experience (e.g. with speech) or speeding up information management through better searches or predictive typing.
The research agenda
The current research agenda naturally ranges widely over the language field, but there are a number of flagship programs designed to meet emerging needs in the communications and content management area as a whole, now that products and service try to address the needs of consumers and citizens in a more language-specific way.
In the EU, which is attempting to harmonize its very disparate national economies and eradicate inequalities, public R&D is focused on ensuring multilingual access to all kinds of content, spoken translation (in cooperation with the U.S. and Japan, and involving cooperative projects), adding rich semantics to all linguistic content, and inventing flexible multimedia interfaces in which language will be seamlessly interwoven with other media.
In the U.S. there is still interest in text translation as a key component in advanced information solutions, where cross-lingual and cross-media (text and audio and video) searching will possibly lead to a more unified approach to customizing knowledge management for all kinds of end users.
With such digital giants as Microsoft, Hewlett-Packard, Oracle and Nokia opening new R&D labs in India and China, Asian languages will gradually be factored into the overall multilingual equation. On the one hand this movement will empower local language speakers to demand and expect content in their own languages; on the other it will provide new markets for localized devices such as mobile phones and games boxes. In due course, complex multilingual regions such as South Africa and India will follow the route that the EU has already taken. Hopefully more quickly, as they will partially benefit from the laborious results in the first 50 years of language technology developed in Europe and the U.S.

