Reputation: 474
I am using Apache OpenNLP library. I am working on a project that needs several NLP tasks performed in different languages and among those Russian is a very important one. However I do not know russian and cannot find any OpenNLP models for russian.
So the only way I can reliably perform sentence detection is to train a sentence detector on a Russian text and produce a model that I will use later. The text I have to analyze is very specific and is not general enough to create a valid model.
Therefore I am asking if anyone can provide me a russian reference text divided in sentences that is general enough (contains common idioms, abbreviations, etc...). I don't know how long it should be since the documentation doesn't specify a suggest size for training texts. However, I think that maybe a few hundred sentences would be enough.
Upvotes: 2
Views: 1324
Reputation: 474
In the end I took the document suggested in the first comment, plus some articles on wikipedia and achieved 98% precisiion, so it's fine :3
Upvotes: 1