Reputation: 65
given this dataframe is it possible to look for particular strings such as the countries that are located inside the countries list? (For example for the first index in 'Country', it has the word Japan inside it and its corresponding value will be 1). Is it possible to sum up the value that corresponds to each country? (End result: Japan: 1+3=4 USA:2 Europe:4)
countries=["Europe","USA","Japan"]
df=pd.DataFrame={'Employees':[1,2,3,4],
'Country':['Japan;Security','USA;Google',"Japan;Sega","Europe;Google"]}
print(df)
Thanks
Upvotes: 2
Views: 55
Reputation: 14949
If you wanna use only those values specified in the country list. You can do something like this -
patt = '(' + '|'.join(countries) + ')'
grp = df.Country.str.extract(pat=patt, expand=False).values
new_df = df.groupby(grp).agg({'Employees': sum})
For example, if the initial country list is missing 'JAPAN' -
countries = ["Europe", "USA"]
patt = '(' + '|'.join(countries) + ')'
grp = df.Country.str.extract(pat=patt, expand=False).values
new_df = df.groupby(grp, dropna=False).agg({'Employees': sum}).reset_index().rename(
columns={'index': 'Country'}).fillna('other')
outptut-
Country Employees
0 Europe 4
1 USA 2
2 other 4 # see the change
Upvotes: 2
Reputation: 93161
Try this:
c = df['Country'].str.split(';', expand=True)[0].to_numpy()
df.groupby(c)['Employees'].sum()
Upvotes: 1