Reputation: 787
This question is an extension to my previous question Multiple Phrases Matching Python Pandas. Although I had figured out the way after an answer to solve my problem, some typical problem in singular and plural words appeared.
ingredients=pd.Series(["vanilla extract","walnut","oat","egg","almond","strawberry"])
df=pd.DataFrame(["1 teaspoons vanilla extract","2 eggs","3 cups chopped walnuts","4 cups rolled oats","1 (10.75 ounce) can Campbell's Condensed Cream of Chicken with Herbs Soup","6 ounces smoke-flavored almonds, finely chopped","sdfgsfgsf","fsfgsgsfgfg","2 small strawberries"])
What I simply needed was match the phrases in the ingredients Series with the phrases in the DataFrame. As a Pseudo code,
If ingredients(singular or plural) found in phrase in the DataFrame, return the ingredient. Or otherwise, return false.
This was achieved by an answer given as follows,
df.columns = ['val']
V = df.val.str.lower().values.astype(str)
K = ingredients.values.astype(str)
df['existence'] = map(''.join, np.where(np.char.count(V, K[...,np.newaxis]),K[...,np.newaxis], '').T)
And I also applied following to fill the empty cells with NAN so that I can easily filter out the data.
df.ix[df.existence=='', 'existence'] = np.nan
The results we as follows,
print df
val existence
0 1 teaspoons vanilla extract vanilla extract
1 2 eggs egg
2 3 cups chopped walnuts walnut
3 4 cups rolled oats oat
4 1 (10.75 ounce) can Campbell's Condensed Cream... NaN
5 6 ounces smoke-flavored almonds, finely chopped almond
6 sdfgsfgsf NaN
7 fsfgsgsfgfg NaN
8 2 small strawberries NaN
This was correct all along but when singular and plural words mapping are not like almond
=> almonds
apple
=> apples
. when something appear like strawberry
=>strawberries
, This code recognize it as a NaN
.
To improve my code to detect such occurrences. I like to change my ingredients Series
to data Frame
as follows.
#ingredients
#inputwords #outputword
vanilla extract vanilla extract
walnut walnut
walnuts walnut
oat oat
oats oat
egg egg
eggs egg
almond almond
almonds almond
strawberry strawberry
strawberries strawberry
cherry cherry
cherries cherry
So my logic here is whenever a word in #inputwords
appear in the phrase I want to return the word in the other cell. I other words, when strawberry
or strawberries
appear in the phrase, the code just out put the word next to it strawberry
. So that my final result will be
val existence
0 1 teaspoons vanilla extract vanilla extract
1 2 eggs egg
2 3 cups chopped walnuts walnut
3 4 cups rolled oats oat
4 1 (10.75 ounce) can Campbell's Condensed Cream... NaN
5 6 ounces smoke-flavored almonds, finely chopped almond
6 sdfgsfgsf NaN
7 fsfgsgsfgfg NaN
8 2 small strawberries strawberry
I cannot find a way to incorporate this functionality to the existing code or write a new code to do so. can anyone help me with this?
Upvotes: 1
Views: 1652
Reputation: 5414
# your data frame
df = pd.DataFrame(data = ["1 teaspoons vanilla extract","2 eggs","3 cups chopped walnuts","4 cups rolled oats","1 (10.75 ounce) can Campbell's Condensed Cream of Chicken with Herbs Soup","6 ounces smoke-flavored almonds, finely chopped","sdfgsfgsf","fsfgsgsfgfg","2 small strawberries"])
# Here you create mapping
mapping = pd.Series(index = ['vanilla extract' , 'walnut','walnuts','oat','oats','egg','eggs','almond','almonds','strawberry','strawberries','cherry','cherries'] ,
data = ['vanilla extract' , 'walnut','walnut','oat','oat','egg','egg','almond','almond','strawberry','strawberry','cherry','cherry'])
# create a function that checks if the value you're looking for exist in specific phrase or not
def get_match(df):
match = np.nan
for key , value in mapping.iterkv():
if key in df[0]:
match = value
return match
# apply this function on each row
df.apply(get_match, axis = 1)
Upvotes: 0
Reputation: 842
consider using a stemmer :) http://www.nltk.org/howto/stem.html
taken straight out of their page:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
stemmer2 = SnowballStemmer("english", ignore_stopwords=True)
>>> print(stemmer.stem("having"))
have
>>> print(stemmer2.stem("having"))
having
Refactor your code to stem all words in the sentence before matching them with the ingredients list.
nltk is an awesome tool for exactly what you're asking!
Cheers
Upvotes: 2