Reputation: 747
I have a text corpus with item descriptions in English, Russian and Polish.
This text corpus has 68K observations. Some of these observations are written in English, some in Russian, and some in Polish.
Could you tell me how properly and cost-efficiently implement a word stemming in this case? I can not use an English stemmer on Russian words and vice versa.
Unfortunately, I could not find a good language identifier. E.g. langdetect
works too slow and often incorrectly. For example, I try to identify language of english word 'today':
detect("today")
"so"
# i.e Somali
So far my code implementation looks bad. I just use one stemmer on another:
import nltk
# polish stemmer
from pymorfologik import Morfologik
clean_items = []
# create stemmers
snowball_en = nltk.SnowballStemmer("english")
snowball_ru = nltk.SnowballStemmer("russian")
stemmer_pl = Morfologik()
# loop over each item; create an index i that goes from 0 to the length
# of the item list
for i in range(0, num_items):
# Call our function for each one, and add the result to the list of
# clean items
cleaned = items.iloc[i]
# to word stem
clean_items.append(snowball_ru.stem(stemmer_pl(snowball_en.stem(cleaned))))
Upvotes: 1
Views: 2801
Reputation: 198324
Even though API is not that great, you can make langdetect
restrict itself only to languages that you are actually working with. For example:
from langdetect.detector_factory import DetectorFactory, PROFILES_DIRECTORY
import os
def get_factory_for(langs):
df = DetectorFactory()
profiles = []
for lang in ['en', 'ru', 'pl']:
with open(os.path.join(PROFILES_DIRECTORY, lang), 'r', encoding='utf-8') as f:
profiles.append(f.read())
df.load_json_profile(profiles)
def _detect_langs(text):
d = df.create()
d.append(text)
return d.get_probabilities()
def _detect(text):
d = df.create()
d.append(text)
return d.detect()
df.detect_langs = _detect_langs
df.detect = _detect
return df
While unrestricted langdetect
seems to think "today"
is Somali, if you only have English, Russian and Polish you can now do this:
df = get_factory_for(['en', 'ru', 'pl'])
df.detect('today') # 'en'
df.detect_langs('today') # [en:0.9999988994459187]
It will still miss a lot ("snow"
is apparently Polish), but it will still drastically cut down on your error rate.
Upvotes: 2