Mike Zoucha
Mike Zoucha

Reputation: 83

Faster Python Lemmatization

I have been testing different lemmatization methods since it will be used on a very large corpus. Below are my methods and results. Does anyone have any tips to speed any of these methods up? Spacy was the fastest with part of speech tags included (preferred), followed by lemminflect. Am I going about this the wrong way? These functions are being applied with pandas .apply() on a dataframe containing the text.

def prepareString_nltk_current(x):
    lemmatizer = WordNetLemmatizer()
    x = re.sub(r"[^0-9a-z]", " ", x)
    if len(x)==0:
        return ''
    tokens = word_tokenize(x)
    tokens = [lemmatizer.lemmatize(word).strip() for word in tokens if word not in stop_words]
    if len(tokens)==0:
        return ''
    return ' '.join(map(str,tokens))

def prepareString_pattern(x):
    error = 'Error'
    x = re.sub(r"[^0-9a-z.,;]", " ", x)
    if len(x)==0:
        return ''
    try:
        return " ".join([lemma(wd) if wd not in ['this', 'his'] else wd for wd in x.split()])
    except StopIteration:
        return error

def prepareString_pattern(x):
    error = 'Error'
    x = re.sub(r"[^0-9a-z.,;]", " ", x)
    if len(x)==0:
        return ''
    try:
        return " ".join([lemma(wd) if wd not in ['this', 'his'] else wd for wd in x.split()])
    except StopIteration:
        return error


def prepareString_spacy_pretrained(x):
    if len(x)==0:
        return ''
    doc = nlp(x)
    return re.sub(r"[^0-9a-zA-Z]", " ", " ".join(str(token.lemma) for token in doc)).lower()

def get_wordnet_pos(word):
    lemmatizer = WordNetLemmatizer()
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": 'a',
                    "N": 'n',
                    "V": 'v',
                    "R": 'r'}

    return lemmatizer.lemmatize(word, tag_dict.get(tag, 'n'))

def prepareString_nltk_pos(x):
    
    tokens = word_tokenize(x)
    if len(x)==0:
        return ''
    return " ".join(get_wordnet_pos(w) for w in tokens)

def prepareString_textblob(x):
    sent = TextBlob(x)
    tag_dict = {"J": 'a', 
                "N": 'n', 
                "V": 'v', 
                "R": 'r'}
    words_and_tags = [(w, tag_dict.get(pos[0], 'n')) for w, pos in sent.tags]    
    return " ".join([wd.lemmatize(tag) for wd, tag in words_and_tags])

def prepareString_genism(x):
    return " ".join([wd.decode('utf-8').split('/')[0] for wd in lemmatize(x)])

def prepareString_leminflect(x):
    doc = nlp(x)
    return " ".join([str(x._.lemma) for x in doc])


def prepareString_pattern_pos(x):
    s = parsetree(x, tags=True, lemmata=True)
    for sentence in s:
        return re.sub(r"[^0-9a-zA-Z]", " ", " ".join([str(x._.lemma()) for x in doc])).lower()

enter image description here

Upvotes: 1

Views: 803

Answers (1)

bivouac0
bivouac0

Reputation: 2560

I think it's the Spacy parsing (creating the POS tags, etc) that takes the time, not the actual lematization. From Lemminflect's REAME, that library takes on average 42uS per lemma (not including parsing). It looks like you're spending more like 42mS (ie.. 1044s / 26536 lemmas). This means you really need to speed up Spacy's parsing.

  1. You can use the smallest Spacy model if you're not already. load('en_core_web_sm')
  2. I think you can disable the NER and dependency parsing to speed things up since you don't need this info. See Spacy's docs for how to load "nlp" and disable these (I'm not certain this can be done but I suspect it can).
  3. You can multi-thread your code, which will give you a speed-up almost linear with the number of cores your machine has.

You can also speed up Lemminflect a bit by using the call getLemmas() with the param lemmatize_oov=False. This will only do dictionary lemma look-up which is very fast. It will not lematize out-of-vocab words (ie.. mispellings, rare words,...) which is much slower. Note that you'll have to parse the sentences to get the upos. In Spacy I think this is token.pos_. See Part-Of-Speech Tags for what lemminflect expects and Spacy's docs to verify if this is the .pos_ attribute.

However, I think your big issue is the parsing and small changes in the lematization speed aren't going to impact you much.

I should also point out that parsing only works if you have your word in a sentence. From your code it looks like you're doing this correctly but I can't tell for sure. Be sure you are since the parser can't select the correct POS if you only give it a single word or a small fragment of text.

Upvotes: 1

Related Questions