Reputation: 240
I have an sample dataset.
raw_data = {
'categories': ['sweet beverage', 'salty snacks', 'beverage,sweet', 'fruit juice,beverage,', 'salty crackers'],
'product_name': ['coca-cola', 'salted pistachios', 'fruit juice', 'lemon tea', 'roasted peanuts']}
df_a = pd.DataFrame(raw_data)
I need to iterate thru the rows in the 'categories' columns, and check if it contains a particular string, in this case, 'beverage', after which i will update the categories to just 'beverage'. This link is the closest i found on stackoverflow, but doesnt tell me how to go thru the whole dataset.
Replace whole string if it contains substring in pandas
Here's my sample code.
for index,row in df.iterrows():
if row.str.contains('beverage', na=False):
df.loc[index,'categories_en'] = 'Beverages'
elif row.str.contains('salty',na=False):
df.loc[index,'categories_en'] = 'Salty Snack'
....<and other conditions>
How can I achive this? Thanks all!
Upvotes: 3
Views: 8716
Reputation: 11
Use the __contains__()
method of Pythons string class:
for a in df_a["categories"]:
if a.__contains__("beverage"):
df_a["categories"].replace(a, "beverage", inplace=True)
Upvotes: 1
Reputation: 240
Thanks for all the various solutions to my question. Based on all your inputs, I have come up with this solution, which works.
def transformCat(df):
df.loc[df.categories_en.str.lower().str.contains('beers|largers|wines|rotwein|biere',na=False)] = 'Alcoholic,Beverages'
df.loc[df.categories_en.str.lower().str.contains('cheese',na=False)] = 'Dairies,Cheeses'
df.loc[df.categories_en.str.lower().str.contains('yogurts',na=False)] = 'Dairies,Yogurts'
df.loc[df.categories_en.str.lower().str.contains(r'sauce.*ketchup|ketchup.*sauce',na=False)] = 'Sauces,Ketchups'
Would appreciate any inputs. Thanks all!
PS - I am aware there should be an indent beginning at df.loc, but since i am new to stackoverflow (i will learn, i promise), somehow I cant get the indentation correct.
Upvotes: 0
Reputation: 38415
You can use
df_a.loc[df_a.categories.str.contains('beverage'), 'categories'] = 'beverage'
categories product_name
0 beverage coca-cola
1 salty snacks salted pistachios
2 beverage fruit juice
3 beverage lemon tea
4 salty crackers roasted peanuts
Upvotes: 1
Reputation: 1022
Use apply
to generate a new categories
column. Then assign it to the categories_en
column of the dataframe.
def map_categories(cat: str) -> str:
if cat.find("beverage") != -1:
return "beverage"
else:
return str
new_col = df['categories'].apply(map_categories)
df['categories_en'] = new_col
Upvotes: 0
Reputation: 323226
Create following dicts , then using replace
Yourdict2={1:'Beverages',2:'salty'}
Yourdict1={'beverage':1,'salty':2}
df_a.categories.replace(Yourdict1,regex=True).map(Yourdict2)
Out[275]:
0 Beverages
1 salty
2 Beverages
3 Beverages
4 salty
Name: categories, dtype: object
Upvotes: 3
Reputation: 199
Maybe you can try something like this:
def selector(x):
if 'beverage' in x:
return 'Beverages'
if 'salty' in x:
return 'Salty snack'
df_a['categories_en'] = df_a['categories'].apply(selector)
Upvotes: 0