Zoozoo
Zoozoo

Reputation: 240

Replace whole string if it contains substring in pandas dataframe

I have an sample dataset.

raw_data = {
    'categories': ['sweet beverage', 'salty snacks', 'beverage,sweet', 'fruit juice,beverage,', 'salty crackers'],
    'product_name': ['coca-cola', 'salted pistachios', 'fruit juice', 'lemon tea', 'roasted peanuts']}
df_a = pd.DataFrame(raw_data)

I need to iterate thru the rows in the 'categories' columns, and check if it contains a particular string, in this case, 'beverage', after which i will update the categories to just 'beverage'. This link is the closest i found on stackoverflow, but doesnt tell me how to go thru the whole dataset.

Replace whole string if it contains substring in pandas

Here's my sample code.

for index,row in df.iterrows():
    if row.str.contains('beverage', na=False):
        df.loc[index,'categories_en'] = 'Beverages' 
    elif row.str.contains('salty',na=False):
        df.loc[index,'categories_en'] = 'Salty Snack'
     ....<and other conditions>

How can I achive this? Thanks all!

Upvotes: 3

Views: 8716

Answers (6)

Xanyar
Xanyar

Reputation: 11

Use the __contains__() method of Pythons string class:

for a in df_a["categories"]:
if a.__contains__("beverage"):
    df_a["categories"].replace(a, "beverage", inplace=True)

Upvotes: 1

Zoozoo
Zoozoo

Reputation: 240

Thanks for all the various solutions to my question. Based on all your inputs, I have come up with this solution, which works.

def transformCat(df):

df.loc[df.categories_en.str.lower().str.contains('beers|largers|wines|rotwein|biere',na=False)] = 'Alcoholic,Beverages'
df.loc[df.categories_en.str.lower().str.contains('cheese',na=False)] = 'Dairies,Cheeses'
df.loc[df.categories_en.str.lower().str.contains('yogurts',na=False)] = 'Dairies,Yogurts'
df.loc[df.categories_en.str.lower().str.contains(r'sauce.*ketchup|ketchup.*sauce',na=False)] = 'Sauces,Ketchups'

Would appreciate any inputs. Thanks all!

PS - I am aware there should be an indent beginning at df.loc, but since i am new to stackoverflow (i will learn, i promise), somehow I cant get the indentation correct.

Upvotes: 0

Vaishali
Vaishali

Reputation: 38415

You can use

df_a.loc[df_a.categories.str.contains('beverage'), 'categories'] = 'beverage'


    categories      product_name
0   beverage        coca-cola
1   salty snacks    salted pistachios
2   beverage        fruit juice
3   beverage        lemon tea
4   salty crackers  roasted peanuts

Upvotes: 1

Omni
Omni

Reputation: 1022

Use apply to generate a new categories column. Then assign it to the categories_en column of the dataframe.

def map_categories(cat: str) -> str:
    if cat.find("beverage") != -1:
        return "beverage"
    else:
        return str
new_col = df['categories'].apply(map_categories)
df['categories_en'] = new_col

Upvotes: 0

BENY
BENY

Reputation: 323226

Create following dicts , then using replace

Yourdict2={1:'Beverages',2:'salty'}
Yourdict1={'beverage':1,'salty':2}
df_a.categories.replace(Yourdict1,regex=True).map(Yourdict2)
Out[275]: 
0    Beverages
1        salty
2    Beverages
3    Beverages
4        salty
Name: categories, dtype: object

Upvotes: 3

relay
relay

Reputation: 199

Maybe you can try something like this:

def selector(x):
    if 'beverage' in x:
        return 'Beverages'
    if 'salty' in x:
        return 'Salty snack'

df_a['categories_en'] = df_a['categories'].apply(selector)

Upvotes: 0

Related Questions