aniketgade
aniketgade

Reputation: 93

Conflicts in the training data for Microsoft Custom Translator

I am using Microsoft Custom Translator and providing the training data in tmx format. My training data has some conflicts. For example, I have English to German training data where I have duplicate English strings but the German translations are different for these duplicate English strings. In such cases, how does it affect the Model ?

Upvotes: 1

Views: 117

Answers (2)

Adam Bittlingmayer
Adam Bittlingmayer

Reputation: 1277

I'll expand on the official and approved answer from our esteemed colleague at Microsoft Translator.

Yes, it happens a lot, and yes it will influence the probabilities in the resulting model.

Is that good? It depends.

Yes, there are target-side conflicts due to different contexts, especially on short strings, but just as often there are other reasons, and unjustifiable inconsistencies.

It's best to actually look at the target-side conflicts and make an executive decision based on the type of the conflicts and the scenario - the overall dataset, the desired behaviour and the behaviour of the generic system.

There are cases where target-side conflicts in training data are desireable or harmless, but at least as often, they're harmful or strike trade-offs.

For example, missing accent marks, bad encodings, nasty hidden characters or other non-human readable differences like double-width parentheses, conflicting locales, untranslated segments, updating style guidelines... are mostly harmful conflicts. One variant could be localising units while the other does not. And, often enough, one variant is just a bad translation.

Very often, these direct conflicts - that is conflicts between segments that have the same exact source, which can be found with a simple script - are a clue about conflicts in the wider dataset - which are harder to find unless you know what you're looking for.

Trade-offs exist between more 1:1 translationese and transcreation, between accuracy and fluency. The former has a bad name but it's less risky and more robust.

The decision could be to drop, resolve or to normalise, or to go debug the dataset and data pipeline.

Just throwing it all in the blackbox and mumbling In Deep Learning We Trust over Manning and Schütze 1999 three times only makes sense if the scale - the frequency with which you train custom models, not the amount of training data - is so high that basic due diligence is not feasible.

To really know, you may need to train the system with and without the conflicts, and evaluate and compare.

Source-side noise and conflicts, on the other hand, are not even really conflicts and are usually safe and even beneficial to include. And they're still worth peeking at.

Upvotes: 2

Chris Wendt
Chris Wendt

Reputation: 571

As long as one side is different, they are merely alternative translations, which happen all the time. The alternatives will be kept, and influence the probabilities in the resulting model.

Upvotes: 3

Related Questions