High repetition ratio in software translations (Localization)

Midahalo ya kiufundi » Localization »
High repetition ratio in software translations
Track this topic

High repetition ratio in software translations

Uwekaji wa uzi: John Moran

John Moran

Ayalandi/Irelandiki
Local time: 21:57
Kijerumani hadi Kiingereza
+ ...

Nov 5, 2005

Hi all,

I am working on a job which has a very high ratio of repetitions and also the segments (sentences, fragments, strings, whatever you want to call them) are not contiguous so they do not need to be translated in sequence.

I normally work with Trados but I did not want to have to scroll through 100 repetitions of the same segment so I wrote a program in Java to store the segments in a MySQL database and then I reindexed the segments so that the files had no repetitions, just new words and fuzzy matches. The program is very rough and ready, I wrote it as a once off (took about 80 hours) and I am now wondering if I should package it into a tool as it really did save me alot of time (more than 80 hours). This is not trivial and might take a year or so of my time as I can only afford work a couple of hours a day on it. At some point I guess it would be nice to see a return on that investment but first I need to see if it is worth it - which brings be to my question.

Are there any tools on the market which do what I just described (reindex resource string files so that there are no repetitions)?

I worked with Catalyst a few years ago and I don't remember it having this feature. I have a vague memory of LocStudio having a feature which put all the repeating segments into a single file so I maybe could have used that but I remember it being a very non-ergonomic environment to work in and I wanted to stick to using MSWord and Trados. I also remember it being expensive and it had a buggy interface to Trados but I think that has been improved.

I guess if I did decide to go ahead with it, this is what it would say on the box:

Features: Import all common resource strings formats, .rc files, CSV, Excel, XML, tabbed delimited etc.

Export into all common CAT tool formats, Trados, Logoport, whatever.

Benefits:

Time: Repetitions are reindexed so that the translator only sees each segment once.

QA: Arbitrary checks can be done on source and target strings, e.g. the number of & characters in the source and target should match, the target should not be longer than the source etc. etc.

Also, reports can be generated, for example how many segments were corrected by the proofreader for each translator working on the job.

I don't really want to start on something that overlaps too much with an existing product in terms of functionality. I guess the main advantage of the tool is that it would let you work in whatever your favourite tool is (e.g. MSWord/Trados) but you still get the benefits of having the job managed by a database, reindexing, QA, reports etc.

Interestingly, getting rid of the repetitions let me calculate the exact return on investment down to the nearest cent.

I probably will not go ahead with it because I am busy with more immediate work but if there is interest in this I would like to hear it before I ditch the idea.

In particular I would be interested to hear from three groups of people.

A) People who have experience managing or engineering large software localization jobs and specifically jobs where the repetitions were a high ratio of total word count. If a job only has 10% reps this tool does not add much value but if it is 90% it does.

B) Anyone who thinks what I just described overlaps with an existing tool/package.

C) Anyone who thinks they would like to collaborate. I am open to talking about any ideas, even open source or partnership.

Sorry about the long post. Hope it wasn't too techie

John

p.s. If anyone wants to talk privately, my address is
[email protected]

(without the minus signs!) ▲ Collapse

John Moran

Ayalandi/Irelandiki
Local time: 21:57
Kijerumani hadi Kiingereza
+ ...

KIANZISHI MADA

RC WinTrans

Nov 5, 2005

Hi again!

Replying to my own post. Classy. Just had a look at the latest version of RCWinTrans. It seems to have alot of what I described.

I guess the main difference is that whoever is working with RCWinTrans has to work out how to use and my tool means the other translators on the job only had to use MSWord and Trados but but it definately overlaps.

Cheers,

John

Jaroslaw Michalak

Polandi
Local time: 22:57
Mwanachama(2004)
Kiingereza hadi Kipolandi

SITE LOCALIZER

Trados?

Nov 5, 2005

You can analyze the files to be translated with Trados and export only unknown segments to a txt or rtf file. Then you have a file with "clean" segments, without external untranslated segments and repetitions. I use that technique a lot, in fact.

pcovs
Denmaki
Local time: 22:57
Kiingereza hadi Kideni

What about the outcome?

Nov 5, 2005

I'm sorry to ask such a stupid question, but when you "export" these non-translated segments and you then translate in this new document, what will the document for delivery look like?

How do you deliver a document that looks like the original document (only translated) containing all repetitions etc. translated to the client?

Harry Bornemann

Meksiko
Local time: 14:57
Kiingereza hadi Kijerumani
+ ...

Access & Perl?

Nov 5, 2005

To get rid of too many repetitions I use to apply an Access query with the function "group by" - very quick and simple.

To import/export between various formats I use Perl, which has an excellent parsing functionality ("regular expressions"). I think it would be difficult to keep up to date with all of the formats of the newest versions of the most common CAT tools (have you considered e.g. POT files? - they have been used for my second largest project), so I write a new Perl script... See more

▲ Collapse

Jaroslaw Michalak

Polandi
Local time: 22:57
Mwanachama(2004)
Kiingereza hadi Kipolandi

SITE LOCALIZER

It's a good question...

Nov 5, 2005

PCovs wrote:

I'm sorry to ask such a stupid question, but when you "export" these non-translated segments and you then translate in this new document, what will the document for delivery look like?

How do you deliver a document that looks like the original document (only translated) containing all repetitions etc. translated to the client?

I did not explain the procedure in detail...

1. I convert the original files with the appropriate tools to a format which can be processed by Trados. This still can be ugly - lots of strings not to be translated (but visible), repetitions, tags, etc.

2. I analyze and export the simple segments to rtf.

3. I translate the temporary file so that I have all the segments needed in the TM (the file itself is not used - it is just a "tool" to get the translation in TM).

4. I translate the converted source file automatically with Trados.

5. I convert the translated file back to the original format.

Trados is not perfect, unfortunately, so the automatic translation needs to be checked. I don't mind, as I still would check the final file anyway.

Of course, if you use SDLX etc. the result is quite similar, but I am most comfortable with Word.

PatriziaM.

Utaliano/Italia
Local time: 22:57
Kiingereza hadi Kitaliano
+ ...

DVX populates automatically

Nov 6, 2005

John Moran wrote:

I normally work with Trados but I did not want to have to scroll through 100 repetitions of the same segment so I wrote a program in Java to store the segments in a MySQL database and then I reindexed the segments so that the files had no repetitions, just new words and fuzzy matches.

Hi!
Do you know DejàVuX? It includes a function that allows to populate automatically all repetitions (that is, the translation is input automatically by DejàVu) after having translated manually the first one. It seems to me that it's very similar to what you describe. Isn't it?

Rodolfo Raya

Local time: 17:57
Kiingereza hadi Kihispania

Heartsome XLIFF Editor also auto-propagates repetitions

Nov 6, 2005

John Moran wrote:

I normally work with Trados but I did not want to have to scroll through 100 repetitions of the same segment...

Once you translate a segment in Heartsome XLIFF Editor (see http://www.heartsome.net ) your translation is automatically copied to all identical segments. Fuzzy matches are also automatically added to segments with a similarity above a user selected threshold.

The statistics provided by the XLIFF Editor already contemplate repetitions and you don't need to alter the source document for preparing a quote.

It doesn't matter if you have 90% of repetitions. CAT tools are designed to save time and reuse data from TM, specially exact matches.

Regards,
Rodolfo

pcovs
Denmaki
Local time: 22:57
Kiingereza hadi Kideni

I see, but why not simply use 'translate to fuzzy'?

Nov 6, 2005

That's what I usually do, but obviously if it's a very large file, it may take some time.

But the extracting etc. also takes time, so I guess it would be a tool to be used only with very large files with a lot of repetitions?

Samuel Murray

Uholanzi
Local time: 22:57
Mwanachama(2006)
Kiingereza hadi Kiafrikana
+ ...

I think Wordfast has it...

Nov 6, 2005

John Moran wrote:
Are there any tools on the market which do what I just described (reindex resource string files so that there are no repetitions)?

Wordfast's extract tool extracts all non-unique segments and saves it in a single file (it also extracts all segments, but that's saved as anothe file). I'm not sure if the extraction count is also subject to the non-registration TM limit (but try it anyway).

I don't really want to start on something that overlaps too much with an existing product in terms of functionality.

Allow me to comment on the features, then.

Features: Import all common resource strings formats, .rc files, CSV, Excel, XML, tabbed delimited etc.

Wordfast can do all of this (although some documents need to be tagged). Except perhaps .rc files... I always get confused with .rc and .res -- Wordfast can import the non-binary one of the two.

Export into all common CAT tool formats, Trados, Logoport, whatever.

Wouldn't it be easier to simply export to TMX?

Repetitions are reindexed so that the translator only sees each segment once.

Wordfast can extract segments. Even if it couldn't, it can still perform a dummy auto translation with source=target, and then you just remove duplicates from the TM, and remove all columns except the source text column.

QA: Arbitrary checks can be done on source and target strings, e.g. the number of & characters in the source and target should match...

I *think* you can try to define the ampersand as a placeable in Wordfast, then Wordfast will QA check to see if the number of placeables match.

Login to reply/comment

To report site rules violations or get help, contact a site moderator:

Msimamizi(wa) wa mdahalo huu
Maya Gorgoshidze	[Call to this topic]
Mahmoud Akbari	[Call to this topic]

You can also contact site staff by submitting a support request »

High repetition ratio in software translations

Forum rules

Help and orientation

LinguaCore
AI Translation at Your Fingertips The underlying LLM technology of LinguaCore offers AI translations of unprecedented quality. Quick and simple. Add a human linguistic review at the end for expert-level quality at a fraction of the cost and time. More info »

CafeTran Espresso
You've never met a CAT tool this clever! Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free Buy now! »

Uwekaji wa hivi punde | MMM | Masharti | Wasimamizi | Fahamumsingi ya makala.

Your current localization setting

Kiswahili

Select a language

More languages...

High repetition ratio in software translations

High repetition ratio in software translations

You have native languages that can be verified

Your current localization setting

Select a language