Language technology consultant Tom Hoar has recently made his corpus processing software available at Sourceforge. Hopefully this is the first of many new offerings in the statistical space to bring down SMT overheads.
How does corpusprocessor help SMT users?
The three hardest aspects of SMT deployment are preparing the training data, preparing the source document to match the training data used to build SMT model and creating tuning sets that match the expected source documents. My application is an assembly line for preparing training data for SMT. Typically, SMT training data extracted from translation memories might involve millions of pairs of sentences in tens of thousands of different files. Usually people would have to use desktop software tools to manually process this material through standard scripts, and pay attention for hours in case things get lost or corrupted. I try and make this job simpler. Corpusprocessor basically offers a framework for aligning words or phrases or sentences, and strategies to clean and normalize them for the task at hand.
Why is corpus development so hard?
When you export TMs from systems designed for machine-assisted translation, they tend to include data associated with a specific stylistic rule for a specific customer, yet these data and rules are irrelevant to the training process. My application lets you largely automate this key process by providing a hierarchy and structure for the data as it goes through the pipeline. Developers can define the state of the raw files and then draw a map of where they go.
What is the advantage of "open sourcing" this development?
I want to build a community of users who will contribute their experience and knowledge to make a useful tool for the entire industry. As an independent consultant, I don't have the appropriate infrastructure to support this package commercially. A community can support and rapidly expand the functionality of an open source project beyond its natural limits. For example, it only handles two languages in parallel at present, but ideally it should be able to handle a whole cohort of parallel language versions and process them simultaneously. At some point, people start using these open source tools and then contribute to them. Obviously it requires a certain level of IT expertise, but I am already seeing a certain amount of interest in corpusprocessor, so it appears to be targeting an acknowledged pain point. I would like to create demand for professional services around this complex SMT requirement.
How did you come to work in this field?
I have worked in various kinds of IT work for almost 25 years, and more particularly in speech and language processing, with speech applications for call centers and translation services, centered most recently on the Thai language. This corpus processor solution comes from a need I identified out of my experience with a large-scale MT project in Asia.
A former Technical Operations Officer with the CIA, Tom Hoar has also worked in language and technology related projects for Nuance Communications, the Center for Speech and Language Processing at Chulalongkorn University, Graphic Vision and AsiaOnline.


