Reputation: 97

Removing multiple substrings in a pandas dataframe column

I have a column of ingredients in a pandas dataframe. I need to remove everything except the name of the ingredient (ex: 1/3 cup cashews > cashews).

Input

    recipe_name                                ingredient
0   Truvani Chocolate Turmeric Caramel Cups    ⅓ cup cashews
1   Truvani Chocolate Turmeric Caramel Cups    4 dates
2   Truvani Chocolate Turmeric Caramel Cups    1 tablespoon almond butter
3   Truvani Chocolate Turmeric Caramel Cups    3 tablespoons coconut milk
4   Truvani Chocolate Turmeric Caramel Cups    ½ teaspoon vanilla extract

Expected Output

    recipe_name                                ingredient
0   Truvani Chocolate Turmeric Caramel Cups    cashews
1   Truvani Chocolate Turmeric Caramel Cups    dates
2   Truvani Chocolate Turmeric Caramel Cups    almond butter
3   Truvani Chocolate Turmeric Caramel Cups    coconut milk
4   Truvani Chocolate Turmeric Caramel Cups    vanilla extract

I've tried using a dictionary, with common words mapped to empty strings like so:

remove_list ={'\d+': '', 'ounces': '', 'ounce': '', 'tablespoons': '', 'tablespoon': '', 'teaspoons': '', 'teaspoon': '', 'cup': '', 'cups': ''}
column = df['ingredient']
column.apply(lambda column: [remove_list[y] if y in remove_list else y for y in column])

This didn't change the data at all.

I've also tried using regex:

df['ingredients'] = re.sub(r'|'.join(map(re.escape, remove_list)), '', df['ingredients'])

But that just gives an error saying "TypeError: expected string or buffer."

I'm very new to Python so I think it's possible with regex, I'm just not sure how to do it.

Upvotes: 6

Answers (3)

ALollz

Reputation: 59519

Since you want to replace everything with the same character, just put them into a list.

l = ['\d+', '[^\x00-\x80]+', 'ounces', 'ounce', 'tablespoons', 
     'tablespoon', 'teaspoons', 'teaspoon', 'cup', 'cups']

Then use one replace, joining everything.

df.ingredient.str.replace('|'.join(l), '', regex=True).str.strip()
# Safer to only replace stand-alone words. strip not needed
#df.ingredient.str.replace('|'.join([x + '\s' for x in l]), '', regex=True)

Output:

0            cashews
1              dates
2      almond butter
3       coconut milk
4    vanilla extract
Name: ingredient, dtype: object

I added '[^\x00-\x80]+' to the list to remove those fractional characters, and the .str.strip removes any excess or leading whitespace after the replacements.

Upvotes: 8

schaefferda

Reputation: 348

You can use a loop and the .split() method:

i = 0
for row in df['ingredient']:
    item = row.split(sep=' ', maxsplit=1)
    df['ingredient'].loc[i] = item[1]
    i += 1

The output will be:

    recipe_name                                ingredient
0   Truvani Chocolate Turmeric Caramel Cups    cup cashews
1   Truvani Chocolate Turmeric Caramel Cups    dates
2   Truvani Chocolate Turmeric Caramel Cups    tablespoon almond butter
3   Truvani Chocolate Turmeric Caramel Cups    tablespoons coconut milk
4   Truvani Chocolate Turmeric Caramel Cups    teaspoon vanilla extract

If you want to keep the measurements you can create a duplicate column and in one column keep the values and in the other keep the ingredient.

Upvotes: -1

CJR

Reputation: 3985

There are a set of string functions built into pandas dataframes for this.

Something like this should work:

df['ingredient'] = df['ingredient'].str.replace('\d+', '', regex=True)

I don't know if you can use a dict with it, you may have to iterate through your dictionary to get all of the replacements that you want.

for ptn, rpl in remove_list.items():
    df['ingredient'] = df['ingredient'].str.replace(ptn, rpl, regex=True)

Upvotes: 0

Removing multiple substrings in a pandas dataframe column

Answers (3)

Output:

Related Questions