Terminology work (5) – extraction

This series offers some insights from the many workshops and presentations on terminology that I have done over the years.

The terminology you are using appears in any written text, be it website pages, brochures, manuals, guidelines, contracts, reports…

Someone will have to read all that text and decide what is a term and extract (copy/paste) it. This can be helped by tools, but make no mistake, the tools are not intelligent. Most terminology extraction tools work on a statistical basis – the more often a term appears, the more important it is. Which is not always the case. An important term might come up only twice, one in the heading and once in the first paragraph and afterwards it might be referred to with its short form. In this case, most statistical tools would not extract the term, as it appears less than 5 times.

There are linguistic extraction tools, but they are limited to the language pair they were built for and are not available for all language pairs. They at least can be configured for example, to extract noun phrases of up to 4 words, which are usually good candidates for a term list. Statistical tools will create a huge list of possible terms, but then this list needs to be checked for the real terms.

From my experience (mostly extractions from English and German technical and medical documents) there is a threshold from which the extraction with a tool makes more sense. I found that up to 20.000 words of text, it does not really make a difference if you read through the text sentence by sentence and select the terms manually or run a statistical extraction tool and then go through the list and mark the terms you want to keep. After that, the extraction with a tool is faster.

Most translation tools will have a component that allows the extraction of terms and can be used both for monolingual (usually source language) material and also for bilingual material, i.e. translation memories, bilingual files from the translation process or alignments of files.

To estimate how much terminology can be extracted, I usually calculate with about 20% or the terms of a list extracted by a tool or between 5 and 15% of the overall word count of the document(s), depending on whether they are more general or more technical in nature.

When extracting terms, make sure you have defined what kind of terms you are looking for (see part 3 of this series: Terminology work (3) – fundamental decisions).



