Isura Nirmal
Isura Nirmal

Reputation: 787

Singular and Plural words matching with Pandas

This question is an extension to my previous question Multiple Phrases Matching Python Pandas. Although I had figured out the way after an answer to solve my problem, some typical problem in singular and plural words appeared.

ingredients=pd.Series(["vanilla extract","walnut","oat","egg","almond","strawberry"])

df=pd.DataFrame(["1 teaspoons vanilla extract","2 eggs","3 cups chopped walnuts","4 cups rolled oats","1 (10.75 ounce) can Campbell's Condensed Cream of Chicken with Herbs Soup","6 ounces smoke-flavored almonds, finely chopped","sdfgsfgsf","fsfgsgsfgfg","2 small strawberries"])

What I simply needed was match the phrases in the ingredients Series with the phrases in the DataFrame. As a Pseudo code,

If ingredients(singular or plural) found in phrase in the DataFrame, return the ingredient. Or otherwise, return false.

This was achieved by an answer given as follows,

df.columns = ['val']
V = df.val.str.lower().values.astype(str)
K = ingredients.values.astype(str)
df['existence'] = map(''.join, np.where(np.char.count(V, K[...,np.newaxis]),K[...,np.newaxis], '').T)

And I also applied following to fill the empty cells with NAN so that I can easily filter out the data.

df.ix[df.existence=='', 'existence'] = np.nan

The results we as follows,

print df
                                                 val        existence
0                        1 teaspoons vanilla extract  vanilla extract
1                                             2 eggs              egg
2                             3 cups chopped walnuts           walnut
3                                 4 cups rolled oats              oat
4  1 (10.75 ounce) can Campbell's Condensed Cream...             NaN    
5    6 ounces smoke-flavored almonds, finely chopped           almond
6                                          sdfgsfgsf              NaN  
7                                        fsfgsgsfgfg              NaN
8  2 small strawberries                                           NaN

This was correct all along but when singular and plural words mapping are not like almond=> almonds apple=> apples. when something appear like strawberry=>strawberries, This code recognize it as a NaN.

To improve my code to detect such occurrences. I like to change my ingredients Series to data Frame as follows.

#ingredients

#inputwords       #outputword

vanilla extract    vanilla extract 
walnut             walnut
walnuts            walnut
oat                oat
oats               oat
egg                egg
eggs               egg
almond             almond
almonds            almond
strawberry         strawberry
strawberries       strawberry
cherry             cherry
cherries           cherry

So my logic here is whenever a word in #inputwords appear in the phrase I want to return the word in the other cell. I other words, when strawberry or strawberries appear in the phrase, the code just out put the word next to it strawberry. So that my final result will be

                                                 val        existence
0                        1 teaspoons vanilla extract  vanilla extract
1                                             2 eggs              egg
2                             3 cups chopped walnuts           walnut
3                                 4 cups rolled oats              oat
4  1 (10.75 ounce) can Campbell's Condensed Cream...             NaN    
5    6 ounces smoke-flavored almonds, finely chopped           almond
6                                          sdfgsfgsf              NaN  
7                                        fsfgsgsfgfg              NaN
8  2 small strawberries                                           strawberry

I cannot find a way to incorporate this functionality to the existing code or write a new code to do so. can anyone help me with this?

Upvotes: 1

Views: 1652

Answers (2)

Nader Hisham
Nader Hisham

Reputation: 5414

# your data frame
df = pd.DataFrame(data = ["1 teaspoons vanilla extract","2 eggs","3 cups chopped walnuts","4 cups rolled oats","1 (10.75 ounce) can Campbell's Condensed Cream of Chicken with Herbs Soup","6 ounces smoke-flavored almonds, finely chopped","sdfgsfgsf","fsfgsgsfgfg","2 small strawberries"])

# Here you create mapping
mapping = pd.Series(index = ['vanilla extract' , 'walnut','walnuts','oat','oats','egg','eggs','almond','almonds','strawberry','strawberries','cherry','cherries'] , 
          data = ['vanilla extract' , 'walnut','walnut','oat','oat','egg','egg','almond','almond','strawberry','strawberry','cherry','cherry'])
# create a function that checks if the value you're looking for exist in specific phrase or not
def get_match(df):
    match = np.nan
    for key , value in mapping.iterkv():
        if key in df[0]:
            match = value
    return match
# apply this function on each row
df.apply(get_match, axis = 1)

Upvotes: 0

OmerBA
OmerBA

Reputation: 842

consider using a stemmer :) http://www.nltk.org/howto/stem.html

taken straight out of their page:

    from nltk.stem.snowball import SnowballStemmer
    stemmer = SnowballStemmer("english")
    stemmer2 = SnowballStemmer("english", ignore_stopwords=True)
    >>> print(stemmer.stem("having"))
    have
    >>> print(stemmer2.stem("having"))
    having

Refactor your code to stem all words in the sentence before matching them with the ingredients list.

nltk is an awesome tool for exactly what you're asking!

Cheers

Upvotes: 2

Related Questions