Stemming of the multilingual text corpus

Question

I have a text corpus with item descriptions in English, Russian and Polish.

This text corpus has 68K observations. Some of these observations are written in English, some in Russian, and some in Polish.

Could you tell me how properly and cost-efficiently implement a word stemming in this case? I can not use an English stemmer on Russian words and vice versa.

Unfortunately, I could not find a good language identifier. E.g. langdetect works too slow and often incorrectly. For example, I try to identify language of english word 'today':

detect("today") 
"so" 
# i.e Somali

So far my code implementation looks bad. I just use one stemmer on another:

import nltk
# polish stemmer
from pymorfologik import Morfologik

clean_items = []

# create stemmers

snowball_en = nltk.SnowballStemmer("english")
snowball_ru = nltk.SnowballStemmer("russian")
stemmer_pl = Morfologik()

# loop over each item; create an index i that goes from 0 to the length
# of the item list 

for i in range(0, num_items):
    # Call our function for each one, and add the result to the list of
    # clean items

    cleaned = items.iloc[i]

    # to word stem
    clean_items.append(snowball_ru.stem(stemmer_pl(snowball_en.stem(cleaned))))

Amadan · Accepted Answer

Even though API is not that great, you can make langdetect restrict itself only to languages that you are actually working with. For example:

from langdetect.detector_factory import DetectorFactory, PROFILES_DIRECTORY
import os

def get_factory_for(langs):
    df = DetectorFactory()
    profiles = []
    for lang in ['en', 'ru', 'pl']:
        with open(os.path.join(PROFILES_DIRECTORY, lang), 'r', encoding='utf-8') as f:
            profiles.append(f.read())
    df.load_json_profile(profiles)

    def _detect_langs(text):
        d = df.create()
        d.append(text)
        return d.get_probabilities()

    def _detect(text):
        d = df.create()
        d.append(text)
        return d.detect()

    df.detect_langs = _detect_langs
    df.detect = _detect
    return df

While unrestricted langdetect seems to think "today" is Somali, if you only have English, Russian and Polish you can now do this:

df = get_factory_for(['en', 'ru', 'pl'])
df.detect('today')         # 'en'
df.detect_langs('today')   # [en:0.9999988994459187]

It will still miss a lot ("snow" is apparently Polish), but it will still drastically cut down on your error rate.

Stemming of the multilingual text corpus

Answers (1)

Related Questions