Reputation: 97
I have a column of ingredients in a pandas dataframe. I need to remove everything except the name of the ingredient (ex: 1/3 cup cashews > cashews).
Input
recipe_name ingredient
0 Truvani Chocolate Turmeric Caramel Cups ⅓ cup cashews
1 Truvani Chocolate Turmeric Caramel Cups 4 dates
2 Truvani Chocolate Turmeric Caramel Cups 1 tablespoon almond butter
3 Truvani Chocolate Turmeric Caramel Cups 3 tablespoons coconut milk
4 Truvani Chocolate Turmeric Caramel Cups ½ teaspoon vanilla extract
Expected Output
recipe_name ingredient
0 Truvani Chocolate Turmeric Caramel Cups cashews
1 Truvani Chocolate Turmeric Caramel Cups dates
2 Truvani Chocolate Turmeric Caramel Cups almond butter
3 Truvani Chocolate Turmeric Caramel Cups coconut milk
4 Truvani Chocolate Turmeric Caramel Cups vanilla extract
I've tried using a dictionary, with common words mapped to empty strings like so:
remove_list ={'\d+': '', 'ounces': '', 'ounce': '', 'tablespoons': '', 'tablespoon': '', 'teaspoons': '', 'teaspoon': '', 'cup': '', 'cups': ''}
column = df['ingredient']
column.apply(lambda column: [remove_list[y] if y in remove_list else y for y in column])
This didn't change the data at all.
I've also tried using regex:
df['ingredients'] = re.sub(r'|'.join(map(re.escape, remove_list)), '', df['ingredients'])
But that just gives an error saying "TypeError: expected string or buffer."
I'm very new to Python so I think it's possible with regex, I'm just not sure how to do it.
Upvotes: 6
Views: 5249
Reputation: 59519
Since you want to replace everything with the same character, just put them into a list.
l = ['\d+', '[^\x00-\x80]+', 'ounces', 'ounce', 'tablespoons',
'tablespoon', 'teaspoons', 'teaspoon', 'cup', 'cups']
Then use one replace
, joining everything.
df.ingredient.str.replace('|'.join(l), '', regex=True).str.strip()
# Safer to only replace stand-alone words. strip not needed
#df.ingredient.str.replace('|'.join([x + '\s' for x in l]), '', regex=True)
0 cashews
1 dates
2 almond butter
3 coconut milk
4 vanilla extract
Name: ingredient, dtype: object
I added '[^\x00-\x80]+'
to the list to remove those fractional characters, and the .str.strip
removes any excess or leading whitespace after the replacements.
Upvotes: 8
Reputation: 348
You can use a loop and the .split()
method:
i = 0
for row in df['ingredient']:
item = row.split(sep=' ', maxsplit=1)
df['ingredient'].loc[i] = item[1]
i += 1
The output will be:
recipe_name ingredient
0 Truvani Chocolate Turmeric Caramel Cups cup cashews
1 Truvani Chocolate Turmeric Caramel Cups dates
2 Truvani Chocolate Turmeric Caramel Cups tablespoon almond butter
3 Truvani Chocolate Turmeric Caramel Cups tablespoons coconut milk
4 Truvani Chocolate Turmeric Caramel Cups teaspoon vanilla extract
If you want to keep the measurements you can create a duplicate column and in one column keep the values and in the other keep the ingredient.
Upvotes: -1
Reputation: 3985
There are a set of string functions built into pandas dataframes for this.
Something like this should work:
df['ingredient'] = df['ingredient'].str.replace('\d+', '', regex=True)
I don't know if you can use a dict with it, you may have to iterate through your dictionary to get all of the replacements that you want.
for ptn, rpl in remove_list.items():
df['ingredient'] = df['ingredient'].str.replace(ptn, rpl, regex=True)
Upvotes: 0