Why am I losing context matches when I move my TM from one tool to another using TMX?

Posted on Leave a commentPosted in Uncategorized

Current TM systems often not only save the segment you translated, but also some context for the segment. The context is used to improve the matching. This is why you might see matches called CM (context match), 101% or 102%, ICE (in-context exact) or similar. They show that the segment you work on right now is not only the same as in the TM, but also the context is the same.

Unfortunately, the way this context information is saved to the TM and what exactly is saved as context is not standardized.

This means, tool A will not be able to read, interpret and use the context information from tool B. You will only receive a 100% match instead of a context match.

Here are examples how some tools save their context:

The sample text was translated into the TMs and the TMs were exported to TMX (Translation Memory Exchange format).

Sample text:

This is sentence one.

This is sentence two.

This is sentence three.

What is in the TMs:

Context for the second sentence in the TMX file from memoQ

The source segments before and after the actual segment are saved as context.

Context for the second sentence in the TMX file from SDL Trados Studio 2017

A hash code with information about the previous segment and the structure of the current segment (heading, footnote, content of a cell…) is saved as context.

Context for the second sentence in the TMX file from SDL Trados Studio 2019

A hash code plus explicit text of source and target segment before the actual segment is saved as context.

Also, the place where the context information is saved could be different (within the tuv area (translation unit variant = language) or before it. And the names of the attributes of the prop element are also different (x-Context versus x-context-pre and x-context-post)).

This shows why it is not possible to re-use context information between tools.

Why do match values differ?

Posted on Leave a commentPosted in Uncategorized

After talking about the things that can produce different word counts, we should also look at what can be the reason for different match values.

Even with the same file and the same TM, the analysis results can differ, because the settings that influence the match values are usually project-based settings.

Let us take penalties first. A penalty can be applied to matches that come from a specific TM, that have metadata other than the one used in your current project or maybe even to segments with a certain user name or user role saved. This means instead of the “real” match value the segment would have, it shows up with a lower match value.

There are many reasons to apply a penalty:

  • The TM has been provided by a client and has not been created by yourself, so you cannot guarantee for its quality.
  • The material in the TM is old or comes from an alignment (most tools will apply a penalty for alignment segments automatically).
  • You have decided to start a new, fresh TM and use the existing TM as a reference in the background.
  • The content was saved to the TM by a certain person (maybe by an intern who did an alignment and was not very careful during aligning the segments) or with a certain role (you want to trust segments confirmed by a reviewer more than those confirmed by a translator).
  • The content was translated for a different subject matter area and this information was saved to the TM as well as metadata (the TM contains translations from marketing, but you now want to translate a contract).

Then, there are filter settings. Usually, applying a filter means to apply a penalty. But it could also be that certain segments do not appear at all, because the filter does not permit segments with different metadata, from a TM with a specific name or from a specific user.

Still another reason could be that the segmentation rules don’t contain all abbreviations. This will result in 2 segments in the document where there might be just one segment in the TM (maybe the translator joined the segments during translation, creating one segment in the TM but not updating the segmentation rules).

And another reason could be the use of different TM tools. As the way how match values are calculated differ from tool to tool, a 82% match in tool 1 can very well be a 80% match tool 2 and an 85% match in tool 3. The match values can differ quite a lot actually, depending on what is in the segments in the way of tags etc.

Here are some examples for differing match values:

Tool 1 shows 70%, tool 2 shows 89%.

The difference is one full word (short -> nice).

Tool 1 shows 95%, tool 2 shows 92%.

The differences are the number, the formatting and the capitalization and the spacing.

And to make it even more complex, the examples show that it is not necessarily the case that one tool always shows lower match values than the other 🙂

Word Count Differences (2)

Posted on 2 CommentsPosted in Uncategorized

How can word counts differ within the same tool on different machines?

Have you ever run a word count with the same document on two different machines and received different word counts?

Well, here is what can have an impact on the word count statistics:

  • The use of a TM on one machine and no TM on the other machine can produce different word counts. A project with no TM will use default settings for counting, which might have been adjusted in the TM you actually use. For example, the setting to count words with hyphens as one or two words.

Example: The same file in the same project gets analyzed without a TM and with a TM where the default settings had been adjusted (and here even the number of segments and characters changes).

without TM:      

 with TM:            

  • The filters you use to import the file have different settings. If a filter includes or excludes hidden text, hidden layers, comments, hidden rows or columns, embedded objects etc. this can have a big impact on the number of words that are counted. I remember one time when a Word document that had visibly only a few words, produced a very large word count because of extracting the content of an embedded Excel file on one machine, but not on the other.

Example: The same file (just with different names) was imported with the default XML filter and with a filter that also imports the content of an attribute for translation.

  • The use of different versions of the software . Believe it or not, the tools providers do tweak the way words are counted now and then. At one point there was a Trados version where a number-measurement combination was counted as two words in one version, but counted as one word in the next version. It took some time to figure that one out, believe me, as it was unfortunately not mentioned in the release notes.
  • The analysis settings you use. The analysis might have an option to ignore locked segments, if that is switched off on one machine, but switched on on the other, the word counts will differ as well (provided of course there are locked segments in the files, for example in XLIFF files from another tool, or if you run an analysis after file preparation).

Word Count Differences

Posted on 1 CommentPosted in Uncategorized

How can it be that the word count for the same file differs from (translation) tool to (translation) tool?

The way a translation tool counts words can differ from any other translation tool as well as the word count you can do in Word. The reason is the way words and word boundaries are defined in the tools. Some specify that a word with a hyphen (like “tool-related”) should be counted as one word, others see it as two words. The same is true for other delimiting characters, like slashes (/) or apostrophes (‘). It can even happen that a character like a slash, if it is surrounded by spaces (like in “in / out) could be counted as a word on its own in one tool, but not at all in another.

Some tools recognize combinations of letters and numbers (alphanumeric items) as one word, but only as long as there is no slash or hyphen that separates numbers from letters (ABC123 = 1 word, but ABC-123 = 2 words).

Depending on the types of elements your file contains, the difference can be quite extensive. A recent example from a file preparation showed elements like these:

/content/legal/privacy?cid=cookieprivacy#cookies-policy

One tool counted that whole expression as 1 word, the other counted 4 words, using the slashes and the equal symbol as word delimiters. Imagine the word count difference if there are 1000 items like this one in the file.

Of course it is debatable what of the above expression needs to be translated, if at all. That would be a nice exercise for the use of regular expressions, either to tag the whole thing or to extract the translatable part. 🙂

And although some tools let you influence the way they count by providing checkboxes to specify words with hyphens as one or two words, it is almost impossible to achieve the exact same word count with any two tools when your documents contain delimiting characters like slashes or equal symbols.

And here is the real-life comparison over 39 files:

Analysis tool A

Analysis tool B

Note that the segment count is quite close but the word count is very different.

REGEX – the hidden language for our translation tools

Posted on Leave a commentPosted in Uncategorized

Our tools offer a lot of functionality, but in many places a knowledge of some simple regex (regular expressions) can enhance these functionalities a lot.

  • You can create your own filters, i.e. determine what parts of a (text-based) document get imported for translation.
  • You can convert text elements into tags, which is especially useful for placeholders like these {1}, ##NAME## or %sd.
  • You can use regex to search for a pattern, like a web address, a date, a combination of number and measurement…
  • You can use it to run a replace action (changing date formats or the sequence of elements, like 25% -.> % 25).
  • You can use regex in the QA checkers to find specific things, like numbers and measurement units that are not separated by a non-breaking space.
  • You can use regex when specifying segmentation rules and segmentation exceptions.

We all know that a good preparation at the beginning of a project can save a lot of repair work (in all the target languages) and regex is definitely a good thing to include into your preparation considerations.

 

I often get asked where there is material to learn how to do regex. Well, there are a lot of very good tutorials on the internet (just use the search words “regex” and “tutorial”). But none of them focuses on the needs of the translation industry (hence my course on Regex for Translation on L10Ntrain).

 

From my experience, the regex you need to know starts with these few expressions:

  • Brackets and what they do: ( ) for grouping, [ ] for character ranges/lists and { , } for minimum and maximum numbers of characters.
  • Characters that have their own meaning in regex and need the backslash (escape character) before them, when you need to search for the actual character:
    • Dot (.), plus (+), asterisk (*) Dollar ($), circumflex (^), backslash (\)
  • Searching for spaces in general: \s
  • Searching for digits: \d or [0-9]
  • Searching for letters: [a-z], [A-Z], \p{Lu}…

 

Of course, there are many more and you can do wonderful things with regex, but these few can get you started quite quickly.

For an overview and more examples, check out the introductory course on Regular Expressions in Translation 🙂

How would you calculate a repetition in the proposal/invoice?

Posted on 1 CommentPosted in Uncategorized

This question comes up right after explaining what a repetition really is and it is not easily answered.

Technically, once the first occurrence of a repeated segment has been translated and saved to the TM, all other occurrences of this segment will appear as 100% matches to the translator. So, they could be invoiced the same way as 100% matches.

BUT, depending on the type of text you have to translate or the language pair you deal with, this might not be the case. For example, a catalog that is translated into German might deal with “gearboxes”, where in German the singular and plural of this word are identical (Getriebe). This means not all occurrences of this segment (maybe the headings in a table) have to be translated in the same way. Or, taking German as a target language again, one and the same sentence in English can be translated in 3 different ways, depending on the gender of the object you are talking about. Example: Connect the one with the other. (Admittedly, that is not good style in English, but it happens 🙂 ). This sentence could have 3 different translations in German, depending on the gender of the thing you are talking about.

 

This also brings up the question whether a 100% match can be left unchecked (as many clients seem to think). It is a 100% match after all, so it was there before, it is in the TM, it has been translated and paid for already… The thing is, a 100% match only tells you that the SOURCE segment has appeared before. But it does not mean that the translation is a complete/correct or fits the context.  That is why translation vendors usually will tell you that even 100% matches should be checked for correctness in context.

Localization Tools – What is a Repetition?

Posted on Leave a commentPosted in Uncategorized

Translation tools are easy enough to get started with, but there are many settings and features that are not so self-explanatory. One of these seems to be the definition for a repetition when doing an analysis of translation documents. When I ask what a repetition is, I get answers ranging from “when words are repeated” to “all sentences that are similar” to “segments that repeat” – where only the last one is partially true.

Here is the definition of a repetition from the tools I have dealt with so far: A repetition is a segment that comes up repeatedly (either inside a document or between documents), which DOES NOT have a 100% match from the TM. This last part is important. If it had a 100% match, it would be counted as such. So if a segment appears 5 times in exactly the same way AND has a 100% match from the TM, it would be counted as 5 100% matches (usually, unless there is a setting to change this type of counting, but that is going too far here 🙂 ).

If it does not have a 100% match, it is counted as a “no match” or “fuzzy match” the first time around. The second time is counted as the first repetition and so forth.

Terminology work (5) – extraction

Posted on Leave a commentPosted in Uncategorized

This series offers some insights from the many workshops and presentations on terminology that I have done over the years.


The terminology you are using appears in any written text, be it website pages, brochures, manuals, guidelines, contracts, reports…

Someone will have to read all that text and decide what is a term and extract (copy/paste) it. This can be helped by tools, but make no mistake, the tools are not intelligent. Most terminology extraction tools work on a statistical basis – the more often a term appears, the more important it is. Which is not always the case. An important term might come up only twice, one in the heading and once in the first paragraph and afterwards it might be referred to with its short form. In this case, most statistical tools would not extract the term, as it appears less than 5 times.

There are linguistic extraction tools, but they are limited to the language pair they were built for and are not available for all language pairs. They at least can be configured for example, to extract noun phrases of up to 4 words, which are usually good candidates for a term list. Statistical tools will create a huge list of possible terms, but then this list needs to be checked for the real terms.

From my experience (mostly extractions from English and German technical and medical documents) there is a threshold from which the extraction with a tool makes more sense. I found that up to 20.000 words of text, it does not really make a difference if you read through the text sentence by sentence and select the terms manually or run a statistical extraction tool and then go through the list and mark the terms you want to keep. After that, the extraction with a tool is faster.

Most translation tools will have a component that allows the extraction of terms and can be used both for monolingual (usually source language) material and also for bilingual material, i.e. translation memories, bilingual files from the translation process or alignments of files.

To estimate how much terminology can be extracted, I usually calculate with about 20% or the terms of a list extracted by a tool or between 5 and 15% of the overall word count of the document(s), depending on whether they are more general or more technical in nature.

When extracting terms, make sure you have defined what kind of terms you are looking for (see part 3 of this series: Terminology work (3) – fundamental decisions).

 

Angelika

(Trainer for translation tools since 1997)

Terminology work (4) – fundamental decisions about the user

Posted on 1 CommentPosted in Uncategorized

This series offers some insights from the many workshops and presentations on terminology that I have done over the years.


Most terminology works starts life in Excel – which is a very good way to get started, but not something you would use for professional terminology management.

Usually, when you start to think of terminology work, you already have a goal in mind or a pain point that needs your attention.

  • Recurring questions from translators – you want to provide them with a term list or term base that can be used in the translation tool (for terminology recognition and terminology checking)
  • Support tickets because users misunderstand the product or process description
  • Company-internal effort to check translated documentation for the correct terminology
  • You want to provide the company terminology to all users in the company through the intranet
  • You want to provide terminology lists to the authors

Depending on the intended user group, the information associated with each term can be different. Whereas a translator needs to know the term, the translation, any forbidden alternatives and the product the term belongs to, other users in your company might need something more like a dictionary with information on gender, plural forms or context examples.

If you want to provide terminology for translation, ask your translation vendors what format a list should have, maybe they already provide online access to their term base system and allow collaboration on terminology online.

If you want to provide terminology as a company dictionary through the intranet, talk to your webmaster how a list can be brought online and, most important, how it can be updated periodically.

If you want to provide term lists for authors, ask them, if they are using a term checking tools in their authoring environment and how a term list would need to look like, to be easily importable.

Any of these settings differ in the way the term lists need to be set up and what kind of information (metadata to the term) needs to be added.

 

Angelika

(Trainer for translation tools since 1997)

Terminology work (3) – fundamental decisions

Posted on Leave a commentPosted in Uncategorized

This series offers some insights from the many workshops and presentations on terminology that I have done over the years.


Now let’s move on to more complex things – you need to decide what a term is.

If you take the view of a dictionary, a term is something that needs to be explained. But make no mistake, mostly the general words, words that everybody knows and seems to understand generate most of the problems when creating or translating text. Words like cap, bolt, device etc. seem to be so general that you would not put them into a company dictionary and therefore also not into a terminology database. But these are exactly the terms that will produce most of the questions and misunderstandings.

Mostly, because they are used as the short form of a longer term. Instead of talking about a “multiple-output generating device” in every second sentence, you would probably use it once or twice and then shorten it to “device”. Everyone who reads the text will see what you mean – but what if this text comes out of a content management system? A translator might get a small module to translate where the long form of the word in not to be found – how should the translator know what “device” exactly the text is talking about?

In this case a good terminology database that states the word “device” as the short form of one or several longer terms and gives some explanation what it is and how it should be translated in different circumstances, can help a lot.

When deciding on what a term is in your special case, try these categories:

  • Everything that has to do with your company and differs from other companies.
  • Everything that is special to your products and where a term differentiates between you and your competitors although you are producing the same thing (keep the term of the competitor as a forbidden term).
  • Things that are special to the subject matter area you work in.
  • Things that need an explanation (don’t forget the everyday words here)
  • Abbreviated forms, acronyms, slogans, mission statements

And now we are at a point where terminology work can get messy and starts to grow uncontrollably.

In order to keep things manageable, limit the terminology collection to the source language and to one product (maybe the base product for others or the most used product). Once you have collected the most important terms here, you can move on to other products or other languages.

Angelika

(Trainer for translation tools since 1997)