Wednesday, May 10, 2006

Collocations and Machine Language Translations

The tendency for words to occur together is called collocation.

The mathematicians that study language and have lots of computing power are forming English language databases. These databases can be used for machine language translation, formulas to rank collocation, most used priority word lists, word grouping tendencies and other linguistics research. Using simple research the text is analyzed for collocations by counting the occurrences of a word and then all of the preceding or accompanying words.

If we take a simple formula approach it is easy to understand collocations. Unique collocations occur with only one set of words in combination. An example of a unique collocation is shrug shoulders. It does not appear in any other combinations.

Strong word collocations occur in use at over 70 percent of the total uses. Example: “Comment on” is considered a strong collocation at 75 percent and "comment about" would be considered a weak collocation at less than 20 percent of the combinations. Other examples of strong collocations: a vivid imagination, problem child, bosom buddy, or dead serious.

Fixed phrases are sometimes considered an extended collocation.
Examples: at the time of writing, it is interesting to note that, to be taken into a account, rather you than me, not on your life, all's well that ends well, under the weather, not for love or money, as far as I'm concerned.

Collocations and Connotation

Not all co-occurring items can be counted as collocations. The following are not considered as collocations: phrasal verbs, idioms and compound nouns.

One factor in the difference between collocation meaning is connotation. Polysemy is a term for words with more than one meaning. Sometimes a word may collocate with one of the word's meanings but not with the other meanings.

With some exceptions collocations are usually a more literate interpretation of the words used together. Phrasal verbs, idioms and compound nouns will have totally different, partial or variable meanings within context to the literate interpretation of the actual words.

Example Collocation Combinations
adjective + noun (light drizzle)
adjective + preposition (big of)
adverb + adjective (very pretty)
adverb + verb (boldly go)
noun + noun (designer collection)
verb + noun (cover ground)
verb + preposition (apply to)

Altavista's Babelfish or Google by Systran allows for so-called gist translation with an error rate of 20 to 30 percent. The large error rate is due to how a word's meaning varies with context. One example: "The spirit is willing but the flesh is weak" translated from English to Russian and back again only to yield "The vodka is good but the meat is rotten." So far Babelfish has only 19 language pairs available and it has taken decades to develop language-pair rules for each of the 9,900 language word pairs.

Newer statistics machine translation systems eliminate the old rule-based gap-filling solutions and depends entirely on a statistical solution by looking for overlaps between sentence fragments and collocation tendencies.

With this new statistics machine translation system it knows what it doesn't know, however the core database may prove to be unwieldy consisting of hundreds of gigabytes and using huge computing power.

Some observations for language students and language teachers is the translation pool for just average translations is 9900 words. The big variable is context, which means that a word can be used in various formats: "formal, industry specific jargon, slang, idioms, act a different part of speech performing a different function within that particular meaning. If every word has an average of five context variables then the student really has to learn 50,000 items.

As final conclusions: second language learning takes time and effort and there should be plenty of translation jobs for the next 20 years if you are willing to invest the seven to nine years to be proficient.

Thanks to Sentence Master Grammar Text


ESL in Canada Blog URL

Blog Feed

Blog Disclaimer
This blog uses original and reprintable articles in whole or part. Posts can be edited for spelling, grammar, accuracy, fairness or to meet ever changing legal publishing standards. We post one link to indicate the original post or source. We rely on the accuracy of the sources. This blog is not responsible for errors or omissions or any liability for any posts or any real, imagined, fabricated, current, past or subsequent damages. For additional info: eslincanada (at) gmail (dot) -com-

Sentence Master
Sentence Master Games provide a fun practical hands-on learning experience that will help students write English sentences, practice their English grammar and improve their English writing style.

No comments: