Reputation: 3574
I have multi-language text that contains a message translated to several languages. For example:
English message
Russian message
Ukrainian message
The order is not exact. I would like to devise some kind of supervised/unsupervised learning algorithm to do the segmentation automatically, and extract each translation in order to create a parallel corpus of data.
Could you suggest any papers/approaches? I am not able to get the proper keywords for googling.
Upvotes: 1
Views: 85
Reputation: 122052
Why don't you try some language identification software? They are reporting > 90% accuracy:
Upvotes: 1
Reputation: 4106
The most basic approach to your problem would be to generate a bag of words from your document. To sum up, a bag of word is a matrix where each row is a line in your document and each column a distinct term.
For instance, if your document is like this :
hello world
привет мир
привіт світ
You will have this matrix :
hello | world | привет | мир | привіт | світ
l1 | 1 | 1 | 0 | 0 | 0 | 0
l2 | 0 | 0 | 1 | 1 | 0 | 0
l3 | 0 | 0 | 0 | 0 | 1 | 1
You can then apply classifications algorithms (such as k-means or svms) according to your needs.
For more details, I would suggest to read this paper which provides a great summary of techniques.
Regarding keywords for googling, I would say text analysis
, text mining
or information retrieval
are a good start.
Upvotes: 2