Word Count Differences (2)

Posted on Leave a commentPosted in Uncategorized

How can word counts differ within the same tool on different machines?

Have you ever run a word count with the same document on two different machines and received different word counts?

Well, here is what can have an impact on the word count statistics:

  • The use of a TM on one machine and no TM on the other machine can produce different word counts. A project with no TM will use default settings for counting, which might have been adjusted in the TM you actually use. For example, the setting to count words with hyphens as one or two words.

Example: The same file in the same project gets analyzed without a TM and with a TM where the default settings had been adjusted (and here even the number of segments and characters changes).

without TM:      

 with TM:            

  • The filters you use to import the file have different settings. If a filter includes or excludes hidden text, hidden layers, comments, hidden rows or columns, embedded objects etc. this can have a big impact on the number of words that are counted. I remember one time when a Word document that had visibly only a few words, produced a very large word count because of extracting the content of an embedded Excel file on one machine, but not on the other.

Example: The same file (just with different names) was imported with the default XML filter and with a filter that also imports the content of an attribute for translation.

  • The use of different versions of the software . Believe it or not, the tools providers do tweak the way words are counted now and then. At one point there was a Trados version where a number-measurement combination was counted as two words in one version, but counted as one word in the next version. It took some time to figure that one out, believe me, as it was unfortunately not mentioned in the release notes.
  • The analysis settings you use. The analysis might have an option to ignore locked segments, if that is switched off on one machine, but switched on on the other, the word counts will differ as well (provided of course there are locked segments in the files, for example in XLIFF files from another tool, or if you run an analysis after file preparation).

Word Count Differences

Posted on Leave a commentPosted in Uncategorized

How can it be that the word count for the same file differs from (translation) tool to (translation) tool?

The way a translation tool counts words can differ from any other translation tool as well as the word count you can do in Word. The reason is the way words and word boundaries are defined in the tools. Some specify that a word with a hyphen (like “tool-related”) should be counted as one word, others see it as two words. The same is true for other delimiting characters, like slashes (/) or apostrophes (‘). It can even happen that a character like a slash, if it is surrounded by spaces (like in “in / out) could be counted as a word on its own in one tool, but not at all in another.

Some tools recognize combinations of letters and numbers (alphanumeric items) as one word, but only as long as there is no slash or hyphen that separates numbers from letters (ABC123 = 1 word, but ABC-123 = 2 words).

Depending on the types of elements your file contains, the difference can be quite extensive. A recent example from a file preparation showed elements like these:

/content/legal/privacy?cid=cookieprivacy#cookies-policy

One tool counted that whole expression as 1 word, the other counted 4 words, using the slashes and the equal symbol as word delimiters. Imagine the word count difference if there are 1000 items like this one in the file.

Of course it is debatable what of the above expression needs to be translated, if at all. That would be a nice exercise for the use of regular expressions, either to tag the whole thing or to extract the translatable part. 🙂

And although some tools let you influence the way they count by providing checkboxes to specify words with hyphens as one or two words, it is almost impossible to achieve the exact same word count with any two tools when your documents contain delimiting characters like slashes or equal symbols.

And here is the real-life comparison over 39 files:

Analysis tool A

Analysis tool B

Note that the segment count is quite close but the word count is very different.

REGEX – the hidden language for our translation tools

Posted on Leave a commentPosted in Uncategorized

Our tools offer a lot of functionality, but in many places a knowledge of some simple regex (regular expressions) can enhance these functionalities a lot.

  • You can create your own filters, i.e. determine what parts of a (text-based) document get imported for translation.
  • You can convert text elements into tags, which is especially useful for placeholders like these {1}, ##NAME## or %sd.
  • You can use regex to search for a pattern, like a web address, a date, a combination of number and measurement…
  • You can use it to run a replace action (changing date formats or the sequence of elements, like 25% -.> % 25).
  • You can use regex in the QA checkers to find specific things, like numbers and measurement units that are not separated by a non-breaking space.
  • You can use regex when specifying segmentation rules and segmentation exceptions.

We all know that a good preparation at the beginning of a project can save a lot of repair work (in all the target languages) and regex is definitely a good thing to include into your preparation considerations.

 

I often get asked where there is material to learn how to do regex. Well, there are a lot of very good tutorials on the internet (just use the search words “regex” and “tutorial”). But none of them focuses on the needs of the translation industry (hence my course on Regex for Translation on L10Ntrain).

 

From my experience, the regex you need to know starts with these few expressions:

  • Brackets and what they do: ( ) for grouping, [ ] for character ranges/lists and { , } for minimum and maximum numbers of characters.
  • Characters that have their own meaning in regex and need the backslash (escape character) before them, when you need to search for the actual character:
    • Dot (.), plus (+), asterisk (*) Dollar ($), circumflex (^), backslash (\)
  • Searching for spaces in general: \s
  • Searching for digits: \d or [0-9]
  • Searching for letters: [a-z], [A-Z], \p{Lu}…

 

Of course, there are many more and you can do wonderful things with regex, but these few can get you started quite quickly.

For an overview and more examples, check out the introductory course on Regular Expressions in Translation 🙂

How would you calculate a repetition in the proposal/invoice?

Posted on 1 CommentPosted in Uncategorized

This question comes up right after explaining what a repetition really is and it is not easily answered.

Technically, once the first occurrence of a repeated segment has been translated and saved to the TM, all other occurrences of this segment will appear as 100% matches to the translator. So, they could be invoiced the same way as 100% matches.

BUT, depending on the type of text you have to translate or the language pair you deal with, this might not be the case. For example, a catalog that is translated into German might deal with “gearboxes”, where in German the singular and plural of this word are identical (Getriebe). This means not all occurrences of this segment (maybe the headings in a table) have to be translated in the same way. Or, taking German as a target language again, one and the same sentence in English can be translated in 3 different ways, depending on the gender of the object you are talking about. Example: Connect the one with the other. (Admittedly, that is not good style in English, but it happens 🙂 ). This sentence could have 3 different translations in German, depending on the gender of the thing you are talking about.

 

This also brings up the question whether a 100% match can be left unchecked (as many clients seem to think). It is a 100% match after all, so it was there before, it is in the TM, it has been translated and paid for already… The thing is, a 100% match only tells you that the SOURCE segment has appeared before. But it does not mean that the translation is a complete/correct or fits the context.  That is why translation vendors usually will tell you that even 100% matches should be checked for correctness in context.

Localization Tools – What is a Repetition?

Posted on Leave a commentPosted in Uncategorized

Translation tools are easy enough to get started with, but there are many settings and features that are not so self-explanatory. One of these seems to be the definition for a repetition when doing an analysis of translation documents. When I ask what a repetition is, I get answers ranging from “when words are repeated” to “all sentences that are similar” to “segments that repeat” – where only the last one is partially true.

Here is the definition of a repetition from the tools I have dealt with so far: A repetition is a segment that comes up repeatedly (either inside a document or between documents), which DOES NOT have a 100% match from the TM. This last part is important. If it had a 100% match, it would be counted as such. So if a segment appears 5 times in exactly the same way AND has a 100% match from the TM, it would be counted as 5 100% matches (usually, unless there is a setting to change this type of counting, but that is going too far here 🙂 ).

If it does not have a 100% match, it is counted as a “no match” or “fuzzy match” the first time around. The second time is counted as the first repetition and so forth.

Terminology work (5) – extraction

Posted on Leave a commentPosted in Uncategorized

This series offers some insights from the many workshops and presentations on terminology that I have done over the years.


The terminology you are using appears in any written text, be it website pages, brochures, manuals, guidelines, contracts, reports…

Someone will have to read all that text and decide what is a term and extract (copy/paste) it. This can be helped by tools, but make no mistake, the tools are not intelligent. Most terminology extraction tools work on a statistical basis – the more often a term appears, the more important it is. Which is not always the case. An important term might come up only twice, one in the heading and once in the first paragraph and afterwards it might be referred to with its short form. In this case, most statistical tools would not extract the term, as it appears less than 5 times.

There are linguistic extraction tools, but they are limited to the language pair they were built for and are not available for all language pairs. They at least can be configured for example, to extract noun phrases of up to 4 words, which are usually good candidates for a term list. Statistical tools will create a huge list of possible terms, but then this list needs to be checked for the real terms.

From my experience (mostly extractions from English and German technical and medical documents) there is a threshold from which the extraction with a tool makes more sense. I found that up to 20.000 words of text, it does not really make a difference if you read through the text sentence by sentence and select the terms manually or run a statistical extraction tool and then go through the list and mark the terms you want to keep. After that, the extraction with a tool is faster.

Most translation tools will have a component that allows the extraction of terms and can be used both for monolingual (usually source language) material and also for bilingual material, i.e. translation memories, bilingual files from the translation process or alignments of files.

To estimate how much terminology can be extracted, I usually calculate with about 20% or the terms of a list extracted by a tool or between 5 and 15% of the overall word count of the document(s), depending on whether they are more general or more technical in nature.

When extracting terms, make sure you have defined what kind of terms you are looking for (see part 3 of this series: Terminology work (3) – fundamental decisions).

 

Angelika

(Trainer for translation tools since 1997)

Terminology work (4) – fundamental decisions about the user

Posted on 1 CommentPosted in Uncategorized

This series offers some insights from the many workshops and presentations on terminology that I have done over the years.


Most terminology works starts life in Excel – which is a very good way to get started, but not something you would use for professional terminology management.

Usually, when you start to think of terminology work, you already have a goal in mind or a pain point that needs your attention.

  • Recurring questions from translators – you want to provide them with a term list or term base that can be used in the translation tool (for terminology recognition and terminology checking)
  • Support tickets because users misunderstand the product or process description
  • Company-internal effort to check translated documentation for the correct terminology
  • You want to provide the company terminology to all users in the company through the intranet
  • You want to provide terminology lists to the authors

Depending on the intended user group, the information associated with each term can be different. Whereas a translator needs to know the term, the translation, any forbidden alternatives and the product the term belongs to, other users in your company might need something more like a dictionary with information on gender, plural forms or context examples.

If you want to provide terminology for translation, ask your translation vendors what format a list should have, maybe they already provide online access to their term base system and allow collaboration on terminology online.

If you want to provide terminology as a company dictionary through the intranet, talk to your webmaster how a list can be brought online and, most important, how it can be updated periodically.

If you want to provide term lists for authors, ask them, if they are using a term checking tools in their authoring environment and how a term list would need to look like, to be easily importable.

Any of these settings differ in the way the term lists need to be set up and what kind of information (metadata to the term) needs to be added.

 

Angelika

(Trainer for translation tools since 1997)

Terminology work (3) – fundamental decisions

Posted on Leave a commentPosted in Uncategorized

This series offers some insights from the many workshops and presentations on terminology that I have done over the years.


Now let’s move on to more complex things – you need to decide what a term is.

If you take the view of a dictionary, a term is something that needs to be explained. But make no mistake, mostly the general words, words that everybody knows and seems to understand generate most of the problems when creating or translating text. Words like cap, bolt, device etc. seem to be so general that you would not put them into a company dictionary and therefore also not into a terminology database. But these are exactly the terms that will produce most of the questions and misunderstandings.

Mostly, because they are used as the short form of a longer term. Instead of talking about a “multiple-output generating device” in every second sentence, you would probably use it once or twice and then shorten it to “device”. Everyone who reads the text will see what you mean – but what if this text comes out of a content management system? A translator might get a small module to translate where the long form of the word in not to be found – how should the translator know what “device” exactly the text is talking about?

In this case a good terminology database that states the word “device” as the short form of one or several longer terms and gives some explanation what it is and how it should be translated in different circumstances, can help a lot.

When deciding on what a term is in your special case, try these categories:

  • Everything that has to do with your company and differs from other companies.
  • Everything that is special to your products and where a term differentiates between you and your competitors although you are producing the same thing (keep the term of the competitor as a forbidden term).
  • Things that are special to the subject matter area you work in.
  • Things that need an explanation (don’t forget the everyday words here)
  • Abbreviated forms, acronyms, slogans, mission statements

And now we are at a point where terminology work can get messy and starts to grow uncontrollably.

In order to keep things manageable, limit the terminology collection to the source language and to one product (maybe the base product for others or the most used product). Once you have collected the most important terms here, you can move on to other products or other languages.

Angelika

(Trainer for translation tools since 1997)

Terminology work (2) – how to get started (continued)

Posted on 1 CommentPosted in Uncategorized

This series offers some insights from the many workshops and presentations on terminology that I have done over the years.


In addition to product names, company names and abbreviations, you probably have a lot of other lists with things that can be considered terminology.

How about…

  • Lists of products or product categories on your website?
  • Lists of trademarks, trade names and maybe even some definition or explanation with it
  • Images in manuals or brochures with associated parts lists (maybe already bilingual or multilingual)
  • Lists of job titles (for e-mail signatures, business cards…) and job descriptions
  • Lists of acronyms (abbreviated forms in capital letters, like OSW or MF’s)
  • Glossaries on your website or in user/training manuals
  • Table of contents and index of larger documents

And once you have collected all the stuff that is used and should be used, don’t forget all the terms and expressions that should NOT be used…

  • Because they are used by a competitor
  • Because the term is outmoded/outdated
  • Because the term should not be used any longer after a merger of companies

Next, check the feedback in the social media, ask people in the support hotline or legal department to see what terms or phrases have drawn comments, complaints or help requests – these are the terms that definitely need to be explained and defined and need to go into your term lists.

Angelika

(Trainer for translation tools since 1997)

Terminology work (1) – how to get started

Posted on 1 CommentPosted in Uncategorized

This series offers some insights from the many workshops and presentations on terminology that I have done over the years.


Don’t be afraid of terminology work!

Yes, there are many things you can do with terminology, but you can also start small and build upon it when time and resources permit.

How would you get started?

The most obvious thing is to collect what is already there.

A list of product and company names

 

  • Decide on the spelling that you want to use.
  • Decide if and when any of these names needs to be different in one of your target markets (for example in countries with different alphabets or Asian countries that use characters rather than letters
  • Make sure everybody in the company knows about this list and that translators have access to it as well.
Now build upon this list
  • Think about how product names are created in your company. Is there a pattern? Should there be a pattern?
  • How do you make sure that everybody uses the product names correctly? Are there checks for the source text authors and translators in place?
  • Do you discuss new product or company names with target language experts who can tell you if the proposed name might have any issues or unintended meanings in that language?
  • Make sure that whoever wants to change one of these names knows that they will have to shoulder the cost of changing it in all documents and all languages.

 

A list of abbreviations and their meanings

Everyone in the company will have a list or post-it or file that lists some of the company-specific abbreviations and their meaning.

  • Collect these lists
  • Award the person who comes up with the longest list
Check the list
  • Make sure that the combination of abbreviation and long form of the word are accurate.
  • Make sure everybody in the company knows about this list and that translators have access to it as well – they will be especially thankful as this list can help the translation tools to recognize better where a sentence ends (i.e. NOT at the dot of an abbreviation).
Now build upon this list
  • Think about how abbreviations are created in your company. Is there a pattern? Should there be a pattern?
  • How do you make sure that everybody uses the abbreviations correctly? Are there checks for the source text authors and translators in place?
  • Make sure that whoever wants to change one of these abbreviations knows that they will have to shoulder the cost of changing it in all documents and all languages.
  • Talk to your translation vendors and create the list in such a way that it can be easily imported into the term base components of the translation tools.
  • See if the lists can also be used within content management or authoring tools to help the authors.

 

These things sound obvious, don’t they? But you would be surprised how often this is one of the last steps when people talk about terminology management.

Angelika

(Trainer for translation tools since 1997)