TAUS - Enabling better translation

Thursday
Sep 09th
Text size
  • Increase font size
  • Default font size
  • Decrease font size

TAUS Glossary

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z

A


Advanced leveraging (AL)

The technique of gaining greater benefits from existing TMs by exploiting parallel data at the sub-sentential level (e.g. short phrases).


B


BLEU score (BLEU)

Bilingual Evaluation Understudy, an algorithm for evaluating MT output against a reference human translation. Best used to evaluate improvements of an MT system over several cycles of training. Bleu is not a useful metric for MT end users trying to evaluate quality.


C


Customization

Adapting an MT engine and tuning it to a specific enterprise or utilization objective. Using relevant data, appropriate terminology, and specific rules, optimized for a given customer or occasion of use. Using customer’s own data. Often associated with RBMT applications.


D


Defense Advanced Research Projects Agency (DARPA)

The US Dept of Defense ta agency has regularly launched R&D projects in the field of text and speech to speech translation in recent years. The TIDES program (Translingual Information Detection, Extraction and Summarization) has been working on MT for intelligence for the past decade.

Data cleaning

Removing unwanted tags and other items in parallel corpora (TMs) or terminology lists to improve quality for SMT processing.

Decoder

An SMT algorithm that searches the target document for a sentence that has the highest probability as a translation for a given source sentence.

Domain, in-domain

A domain is an acknowledged universe of discourse, associated with an industry, company or product, exhibiting specific terminology and other linguistic features. In-domain terminology or language data covers terminology or data belonging to that industry, company or product


E


Engine

An individual exponent of an MT system. A system could have several engines, e.g. covering different language pairs, or dedicated to specific domains.

Example based MT (EBMT)

Knowledge is acquired from a bilingual text using basic statistics (similar to learning by analogy). In many ways it is an early form of SMT.


G


General Text Matcher (GTM)

A software package that measures the similarity between texts by matching between the components of e.g. a text and its translation. GTM can be used to help evaluate MT, by checking whether all elements in the source are represented in the target.


H


Hybrid MT (HMT)

An MT system that combines both rule based and statistical processes. Also more generally describes any MT system that uses TMs and other data sources in the workflow.


L


Language/Localization Business Innovation (LBI)

The drive to improve and extend translation automation processes in the localization and language industry as a whole.

Language model

A computer picture of typical structures of a given natural language. It is based on examples drawn from extensive data on a source or target language in translation.

Language pair

Any two languages used in a translation context (source and target).

Language Service Provider (LSP)

A company or organization that provides dedicated translation/language services to the industry or community.


M


METEOR

A software program that automatically evaluates the output of machine translation engines by comparing to them to one or more reference translations. An improvement on BLEU.

Monolingual data

Language resources whose style and terms can feed into the output.

Moses

An open source SMT system developed and maintained the Moses community and now available to potential end users everywhere.


N


National Institute of Standards and Technology (NIST)

A US federal agency, NIST Open Machine Translation (OpenMT) runs cycles of MT evaluation on various languge pairs in which engines can compete.

Normalization

Cleaning TMs so they are better able to train an SMT workflow. Includes checking and removing unnecessary inline tags, irrelevant bits of data, mistranslations of homonyms, acronyms spelled out in target versions, one-into-two sentence mismatches, punctuation inconsistencies and upper/lower case mismatches.


P


Phrase Table

In SMT, a large set of n-gram (word or phrase) pairs over the source and target languages, together with their translation probabilities. Can grow to millions of items for a given translation job.

Postediting, posteditor (PE)

Rapidly repairing MT output to align it with the end user’s expected quality levels. Usually carried out by translators specially trained to make rapid decisions on repair strategy per segment.

Preprocessing

A variety of operations on the source text to optimize it for MT. Usually involves ‘cleaning’ formatting errors and running regular expression checks to make a source text as high quality M-translatable as possible.


R


Rule based machine translation (RBMT)

An MT engine built on algorithms that analyze the syntax of the source and uses rules to transfer the meaning to the target language by building a sentence. Contrast this with the processes of data searching and selecting on the basis of probabilities in SMT.


S


SAE J2450

A formal translation quality metric defined by the automotive industry that focuses on the following criteria for evaluation: Incorrect Term, Syntactic Error, Omission, Word Structure or Agreement, Misspelling, Punctuation, Miscellaneous Error.

Source language (SL)

The language from which a translation is made.

Source text (ST)

The document in the source language.

Statistical machine translation (SMT)

An MT system that uses algorithms to establish probabilities between segments in a source and target language document to propose translation candidates. Also known as ‘data-driven’ MT to contrast the approach with a RBMT system,

Statistical postediting (SPE)

An automated process whereby postedited output can be re-used as training data for an SMT system to improve quality in the next cycle and reduce the subsequent postediting load. Can also be used in a hybrid workflow.


T


Target language (TL)

The language into which a translation is made.

TAUS Data Association (TDA)

The first not-for-profit translation data repository (or cloud) intended to provide the translation industry with very large scale (multi-billions of words) shareable and curated language resources.

TAUS Search

An online string search tool enabling anyone to search the TDA cloud for parallel sets of strings.

TBX

An ISO-approved open, XML-based standard for exchanging structured terminological data.

Total Leveraging (TL)

Term proposed by KCSL for using exclusively in-domain TMs to drive automation.

Training Data

The set of sentences selected during the process of setting up a SMT workflow used to train/customize the engine to the domain/languages in question.

Translation Error Rate (Plus) (TER(p))

An automatic metric for measuring the number of edit operations needed to transform MT output into a human translated reference. Used to assess the post editing load.
TERp is a TER extension that automatically generates paraphrases and synonyms, stems words, and provides other powerful improvements.

Translation Memory (TM)

Translation memory. A database that stores previously translated sentences that can be reused on a sentence by sentence basis. The database matches source to target language pairs.

Translation Memory Exchange (TMX)

A vendor-neutral open XML standard to simplify the conversion of TMs between formats.


X


XLIFF

The XML Localization Interchange File Format that can be understood by any localization provider.

 

JOIN OUR MAILING LIST

Reports

 

Postediting in Practice

 

Implement Open Source MT

 

Increase Your Leveraging

Members

 

CLS Communication, Elisabeth Maier

Chief Technology Officer Dr. Elisabeth Maier is re...

 

McAfee, Paul Walsh

Paul Walsh is the Vice President for Localization ...

 

SDL, Jeremy Harpham

Jeremy Harpham is Senior Product Marketing Manager...