Reputation: 21
I have a question that I cant solve alone. I am currently building an NLP preprocessing pipeline and though about using wordninja with cyrilic languages (Russian and Ukrainian) I have set the dictionaries as described and everything seemed to look alright, but I can make it work.
import wordninja
wordninja.DEFAULT_LANGUAGE_MODEL = wordninja.LanguageModel('setup/ru_ninja_dict.txt.gz')
wordninja.split("приветпока")
(the output is an empty list [], while ["привет", "пока"] was expected)
My main assumption is that there is an issue with encodings. However, I do not know how to check it myself.
Please let me know if you have any ideas!
Upvotes: 2
Views: 391
Reputation: 21
Ok. So, as I've figured out, there was an issue in compiling the regex pattern. In the original wordninja code there is
_SPLIT_RE = re.compile("[^a-zA-Z0-9']+")
which will only work with a limited number of languages. (definitely not Cyrillic)
replace with
_SPLIT_RE = re.compile("[U+0400–U+04FF]+")
for it to work appropriately with Russian, Ukrainian and other slavic languages.
Upvotes: 0