TAUS - Enabling better translation

Wednesday
Feb 22nd
Text size
  • Increase font size
  • Default font size
  • Decrease font size
Home

The future is Corpus Linguistics

E-mail Print

darts

How to translate‘cloud computing’, ‘cell phone’or ‘crowdsourcing’ in your language?

Many translators will use Google, Linguee, TAUS Search or a similar search tool and these will sieve through the data to find the answers. There’s knowledge in the data,and we have oceans of it.

 

It’s time for the translation sector to embrace corpus linguistics, which studies language in its practical use, not as a theory. Corpus linguistics is empirical rather than normative. Translation in the 21st century requires a pragmatic and dynamic approach to solving language challenges.On October 14, 2009 The Wall Street Journal reported on the struggle of France’s Commission of Terminology and Neology to decide on an official French equivalent for the new word ‘cloud computing’. It took the commission 18 months to come up with the term ‘informatique en nuage’, which was then rejected by the chairman.

This normative and regulatory approach to terminology seems so last century now. In our fast-moving global economy we can’t wait for committees. The world is communicating in real-time.

We are often both information consumers and publishers ourselves. Every speaker of a language has the right to change his or her language, invent new terms or give existing terms new meaning, and the point is: They do! As language professionals it is our task to track and understand the dynamics of language use. Hence our view on corpus linguistics.

What is corpus linguistics?

Corpus linguistics analyzes collections of text or documents. These collections or corpora may be very big (billions of words) or relatively small (a few hundred thousand words). The analysis usually starts with tokenization (identifying the individual words), stemming (reducing to root forms) and part-of-speech tagging (syntactical class) before useful data can be extracted from the corpus.

To enhance the usefulness of corpora further intelligent processes can be applied such as cleaning, categorizing and clustering the data. Corpus linguistics has been around for decades. It is the key to extremely valuable business applications such as text mining, knowledge management and search. It is only relatively recently that it is making some inroads into the translation space with statistical and hybrid machine translation applications and very large TM sharing platforms such as Translation Workspace, MyMemory, Google Translation Toolkit and TAUS Data Association.

The journey from project TM to shared data and corpus linguistics

Translation Memory (TM) tools in general continue to make a slow journey out of the last century. One-by-one they are adopting more sophisticated features, but most are not utilizing the full benefits of corpus linguistics. The table below illustrates the differences.

 

20th Century Translation

21st Century Translation

Resources focus

TM, terminology

Data

Analysis focus

Document, project

Corpus

User focus

Translator

Enterprise

Market focus

Client, project

Domains, sector, client, product

Level of interaction

Highly manual

Mostly automated

Leveraging

Segment matching

Statistical and linguistic

Applications

Translation support
Terminology support

Translation support
Text mining
Terminology harvesting
Search engine optimization
Many more …

Benefits

Productivity enhancement

Discover and organize knowledge, Enhance productivity

Integration

GMS and TMS workflows

Content, Search, Knowledge Management, Enterprise applications

While existing TM tools are slowly adding features such as sub-segment matching, corpus-based leveraging and term extraction, new translation tools are being researched and will be developed for the market in the coming years. These will be designed based on 21st century translation needs.

Translation needs then and now

TM software was developed as a productivity tool for the translator. It was designed originally to manage the translation of a single document or project. The Globalization and Translation Management Systems were added later as a workflow to scale up and make TM tools fit for enterprise use.

Traditionally TM tools use the sentence as the basis for full and fuzzy matches. As long your translation practice concentrates on the revisions and updates of documents and products, the leveraging score of TM tools can be very high.

However, the market is changing. The increasing need for rapid-turnaround translation of smaller bits of content has brought developers back to the design table. Their focus is not the only productivity of the translator but the agility of the enterprise.

They rely on advanced statistical approaches – as already applied in Statistical and Hybrid MT systems – and they bring sophisticated linguistic intelligence into the mix as well.

They are not looking to leverage from a single document or project, but to use as much domain-specific text as possible. Their advanced leveraging capability can further increase translation productivity.

What’s more, they bring significant new value by helping the enterprise to discover and organize knowledge in multiple languages. Data-driven translation technology brings opportunities for new services and applications, such as terminology harvesting, automatic identification of synonyms and related terms, search engine optimization, sentiment analysis, quality evaluation, predictive translation and multilingual authoring.

At the foundation of this emergence of new translation technology is our old friend corpus linguistics. As the market starts shifting to collaborative business models and shared resources and services, corpus linguistics will also prove valuable for the development of sector-specific cross-lingual solutions, for instance in healthcare, chemical, financial, government and legal markets.

Clearly, traditional TM tools will continue to prove their value for traditional translation work, but the benefits of the new technology will be too great to be ignored.


TAUS is researching corpus linguistics and language data management. An introductory report on Corpus Analysis and Language Data Management is scheduled for autumn 2011. We invite your comments and experience with data-driven translation tools. We would also like to hear from you if you are developing new translation technology.

 

Comments  

 
0 #2 Anthony Shore 2011-08-17 19:10
It seems the present, not just the future, is corpus linguistics. Just recently, I wrote an article -- the first I know of in this subject area -- discussing how corpus linguistics can be applied to help develop creative brand names.

For those interested, here's a link:
http://operativewords.blogspot.com/2011/08/how-to-create-names-using-worlds-most.html

Enjoy!

- Anth
Quote
 
 
+2 #1 Gabriel 2011-08-04 19:07
So, a lot of stuff to comment here... Getting back to the anecdote of the French Government, this clearly proves that linguistics and/or translation are not only a matter of technology, but also of ideology and in this case of language conservatism-conservation (French could have also use the word "software" instead of "logiciel" as other do...But the don´t, why? I suppose ´cause language-makers have resources enough to depict every nouance of reality and their politicians have the strength to bring people to stick to it).
I have also observed that the table content is also out-of -scope. For instance, from the very beginning of the use of translation using previous translations (back in the middle 50's ) the focus has been on business needs. The European Coal and Steel Community (ECSC) developed on this background approaches to reuse previously translated content and they were ENTERPRISE-FOCUSED. I also see some lagunes when it comes here to establish a difference between CAT and MT...
Quote
 

Add comment


Security code
Refresh

SUBSCRIBE TO OUR FREE NEWSLETTERS AND ALERTS

Learn about the best translation technologies, open platforms and interoperability, the possibilities of machine translation. Subscribe to our alerts and keep up to date with the latest events, articles and reports.

JOIN OUR MAILING LIST

OTHER TAUS SITES

TRANSLATION AUTOMATION TIMELINE

At TAUS we're forward-thinking. Which means we try to know our history. So explore with us the story of translation automation in the digital age. See timeline

RECENT VIDEOS

Researchers debate on future translation technologies
Researchers focus on a myriad of nuances in search of improvements. Major research groups and leading global researchers help to ground us in reality and help shed light on what we can expect in the near future.
View more videos