rodrigocf
rodrigocf

Reputation: 2099

find variations of a word in a string on python

So, I'm running Python 3.3.2, I have a string (sentence, paragraph(s)):

mystring=["walk walked walking talk talking talks talked fly flying"]

And i have another list with words i need to search in that string:

list_of_words=["walk","talk","fly"]

And my question is, is there a way to get as result:

  1. The word walk or a variation is present 3 times
  2. The word talk or a variation is present 4 times
  3. The word fly or a variation is present 2 times

Bottom line, is it possible to get a count on all possible variations of a word?

Upvotes: 2

Views: 6113

Answers (4)

Saeed
Saeed

Reputation: 2099

import spacy
nlp = spacy.load("en_core_web_sm-3.7.1") # models need to be first downloaded from https://spacy.io/usage/models

mystring="walk walked walking talk talking talks talked fly flying"
list_of_words=["walk","talk","fly"]

doc = nlp(mystring)

verb_lemmas_in_list_of_words = [token.lemma_ for token in doc if token.pos_ == 'VERB' and token.lemma_  in list_of_words]
verb_lemmas_in_list_of_words


['walk', 'walk', 'talk', 'talk', 'fly']

Upvotes: 0

eyquem
eyquem

Reputation: 27585

from difflib import get_close_matches
mystring="walk walked walking talk talking talks talked fly flying"
list_of_words=["walk","talk","fly"]

sp = mystring.split()
for x in list_of_words:
    li = [y for y in get_close_matches(x,sp,cutoff=0.5) if x in y]
    print '%-7s %d in %-10s' % (x,len(li),li)

result

walk    2  in ['walk', 'walked']
talk    3  in ['talk', 'talks', 'talked']
fly     2  in ['fly', 'flying']

The cutoff refers to the same ratio as computed by SequenceMatcher :

from difflib import SequenceMatcher

sq = SequenceMatcher(None)
for x in list_of_words:
    for w in sp:
        sq.set_seqs(x,w)
        print '%-7s %-10s %f' % (x,w,sq.ratio())

result

walk    walk       1.000000
walk    walked     0.800000
walk    walking    0.727273
walk    talk       0.750000
walk    talking    0.545455
walk    talks      0.666667
walk    talked     0.600000
walk    fly        0.285714
walk    flying     0.200000
talk    walk       0.750000
talk    walked     0.600000
talk    walking    0.545455
talk    talk       1.000000
talk    talking    0.727273
talk    talks      0.888889
talk    talked     0.800000
talk    fly        0.285714
talk    flying     0.200000
fly     walk       0.285714
fly     walked     0.222222
fly     walking    0.200000
fly     talk       0.285714
fly     talking    0.200000
fly     talks      0.250000
fly     talked     0.222222
fly     fly        1.000000
fly     flying     0.666667

Upvotes: 2

Matt S
Matt S

Reputation: 31

I know this is an old question, but I feel that this discussion wouldn't be complete without mentioning the NLTK library, which provides a ton of Natural Language Processing tools, including one that can perform this task pretty easily.

Essentially, you want to compare the uninflected words in the target list to the uninflected forms of the words in the mystring. There are two common ways of removing inflections (eg. -ing -ed -s): stemming or lemmatizing. In English, lemmatizing, which reduces a word to its dictionary form, is usually better, but for this task, I think stemming is right. Stemming is usually faster anyway.

mystring="walk walked walking talk talking talks talked fly flying"
list_of_words=["walk","talk","fly"]

word_counts = {}

from nltk.stem.snowball import EnglishStemmer
stemmer = EnglishStemmer()

for target in list_of_words:
    word_counts[target] = 0

    for word in mystring.split(' '):

        # Stem the word and compare it to the stem of the target
        stem = stemmer.stem(word)        
        if stem == stemmer.stem(target):
            word_counts[target] += 1

print word_counts

Output:

{'fly': 2, 'talk': 4, 'walk': 3}

Upvotes: 3

Michael0x2a
Michael0x2a

Reputation: 64258

One method might be to split the string by spaces, then look for all the words that contain the particular word you want to find a variation for.

For example:

def num_variations(word, sentence):
    return sum(1 for snippit in sentence.split(' ') if word in snippit)

for word in ["walk", "talk", "fly"]:
    print word, num_variations(word, "walk walked walking talk talking talks talked fly flying")

However, this method is somewhat naive and wouldn't understand English morphology. For example, using this method, "fly" would not match "flies".

In that case, you might need to use some sort of natural language library that comes equipped with a decent dictionary to catch these edge cases.

You may find this answer useful. It accomplishes something similar by using the NLTK library to find the stem of the word (removing plurals, irregular spellings, etc) then summing them up using a method similar to the one above. It may be overkill for your case though, depending on precisely what you're trying to accomplish.

Upvotes: 2

Related Questions