Reputation: 2099
So, I'm running Python 3.3.2, I have a string (sentence, paragraph(s)):
mystring=["walk walked walking talk talking talks talked fly flying"]
And i have another list with words i need to search in that string:
list_of_words=["walk","talk","fly"]
And my question is, is there a way to get as result:
Bottom line, is it possible to get a count on all possible variations of a word?
Upvotes: 2
Views: 6113
Reputation: 2099
import spacy
nlp = spacy.load("en_core_web_sm-3.7.1") # models need to be first downloaded from https://spacy.io/usage/models
mystring="walk walked walking talk talking talks talked fly flying"
list_of_words=["walk","talk","fly"]
doc = nlp(mystring)
verb_lemmas_in_list_of_words = [token.lemma_ for token in doc if token.pos_ == 'VERB' and token.lemma_ in list_of_words]
verb_lemmas_in_list_of_words
['walk', 'walk', 'talk', 'talk', 'fly']
Upvotes: 0
Reputation: 27585
from difflib import get_close_matches
mystring="walk walked walking talk talking talks talked fly flying"
list_of_words=["walk","talk","fly"]
sp = mystring.split()
for x in list_of_words:
li = [y for y in get_close_matches(x,sp,cutoff=0.5) if x in y]
print '%-7s %d in %-10s' % (x,len(li),li)
result
walk 2 in ['walk', 'walked']
talk 3 in ['talk', 'talks', 'talked']
fly 2 in ['fly', 'flying']
The cutoff refers to the same ratio as computed by SequenceMatcher
:
from difflib import SequenceMatcher
sq = SequenceMatcher(None)
for x in list_of_words:
for w in sp:
sq.set_seqs(x,w)
print '%-7s %-10s %f' % (x,w,sq.ratio())
result
walk walk 1.000000
walk walked 0.800000
walk walking 0.727273
walk talk 0.750000
walk talking 0.545455
walk talks 0.666667
walk talked 0.600000
walk fly 0.285714
walk flying 0.200000
talk walk 0.750000
talk walked 0.600000
talk walking 0.545455
talk talk 1.000000
talk talking 0.727273
talk talks 0.888889
talk talked 0.800000
talk fly 0.285714
talk flying 0.200000
fly walk 0.285714
fly walked 0.222222
fly walking 0.200000
fly talk 0.285714
fly talking 0.200000
fly talks 0.250000
fly talked 0.222222
fly fly 1.000000
fly flying 0.666667
Upvotes: 2
Reputation: 31
I know this is an old question, but I feel that this discussion wouldn't be complete without mentioning the NLTK library, which provides a ton of Natural Language Processing tools, including one that can perform this task pretty easily.
Essentially, you want to compare the uninflected words in the target list to the uninflected forms of the words in the mystring. There are two common ways of removing inflections (eg. -ing -ed -s): stemming or lemmatizing. In English, lemmatizing, which reduces a word to its dictionary form, is usually better, but for this task, I think stemming is right. Stemming is usually faster anyway.
mystring="walk walked walking talk talking talks talked fly flying"
list_of_words=["walk","talk","fly"]
word_counts = {}
from nltk.stem.snowball import EnglishStemmer
stemmer = EnglishStemmer()
for target in list_of_words:
word_counts[target] = 0
for word in mystring.split(' '):
# Stem the word and compare it to the stem of the target
stem = stemmer.stem(word)
if stem == stemmer.stem(target):
word_counts[target] += 1
print word_counts
Output:
{'fly': 2, 'talk': 4, 'walk': 3}
Upvotes: 3
Reputation: 64258
One method might be to split the string by spaces, then look for all the words that contain the particular word you want to find a variation for.
For example:
def num_variations(word, sentence):
return sum(1 for snippit in sentence.split(' ') if word in snippit)
for word in ["walk", "talk", "fly"]:
print word, num_variations(word, "walk walked walking talk talking talks talked fly flying")
However, this method is somewhat naive and wouldn't understand English morphology. For example, using this method, "fly" would not match "flies".
In that case, you might need to use some sort of natural language library that comes equipped with a decent dictionary to catch these edge cases.
You may find this answer useful. It accomplishes something similar by using the NLTK library to find the stem of the word (removing plurals, irregular spellings, etc) then summing them up using a method similar to the one above. It may be overkill for your case though, depending on precisely what you're trying to accomplish.
Upvotes: 2