Segment multilanguage parallel text

Question

I have multi-language text that contains a message translated to several languages. For example:

English message
Russian message
Ukrainian message

The order is not exact. I would like to devise some kind of supervised/unsupervised learning algorithm to do the segmentation automatically, and extract each translation in order to create a parallel corpus of data.

Could you suggest any papers/approaches? I am not able to get the proper keywords for googling.

merours · Accepted Answer

The most basic approach to your problem would be to generate a bag of words from your document. To sum up, a bag of word is a matrix where each row is a line in your document and each column a distinct term.

For instance, if your document is like this :

hello world
привет мир
привіт світ

You will have this matrix :

     hello | world | привет | мир | привіт | світ
l1 | 1     |    1  |   0    | 0   |   0    | 0
l2 | 0     |    0  |   1    | 1   |   0    | 0
l3 | 0     |    0  |   0    | 0   |   1    | 1

You can then apply classifications algorithms (such as k-means or svms) according to your needs.

For more details, I would suggest to read this paper which provides a great summary of techniques.

Regarding keywords for googling, I would say text analysis, text mining or information retrieval are a good start.

Segment multilanguage parallel text

Answers (2)

Related Questions