Ivan Stankov
Ivan Stankov

Reputation: 21

wordninja does not work with other languages

I have a question that I cant solve alone. I am currently building an NLP preprocessing pipeline and though about using wordninja with cyrilic languages (Russian and Ukrainian) I have set the dictionaries as described and everything seemed to look alright, but I can make it work.

import wordninja
wordninja.DEFAULT_LANGUAGE_MODEL = wordninja.LanguageModel('setup/ru_ninja_dict.txt.gz')
wordninja.split("приветпока")

(the output is an empty list [], while ["привет", "пока"] was expected)

My main assumption is that there is an issue with encodings. However, I do not know how to check it myself.

Please let me know if you have any ideas!

Upvotes: 2

Views: 391

Answers (1)

Ivan Stankov
Ivan Stankov

Reputation: 21

Ok. So, as I've figured out, there was an issue in compiling the regex pattern. In the original wordninja code there is

_SPLIT_RE = re.compile("[^a-zA-Z0-9']+")

which will only work with a limited number of languages. (definitely not Cyrillic)

replace with

_SPLIT_RE = re.compile("[U+0400–U+04FF]+")

for it to work appropriately with Russian, Ukrainian and other slavic languages.

Upvotes: 0

Related Questions