Reputation: 2672
I have a dataframe like this:
Cause_of_death famous_for name nationality
suicide by hanging African jazz XYZ South
unknown Korean president ABC South
heart attack businessman EFG American
heart failure Prime Minister LMN Indian
heart problems African writer PQR South
And the dataframe is too big. What I want to do is to make changes in the nationality column. You can see that for the nationality = South, we have Korea and Africa as a part of the strings in the famous_for column. So What I want to do is change the nationality to South Africa if famous_for contains Africa and nationality to South Korea if famous_for contains Korea.
What I had tried is:
for i in deaths['nationality']:
if (deaths['nationality']=='South'):
if deaths['famous_for'].contains('Korea'):
deaths['nationality']='South Korea'
elif deaths['famous_for'].contains('Korea'):
deaths['nationality']='South Africa'
else:
pass
Upvotes: 0
Views: 269
Reputation: 19947
You can use contains() to check if the famous_for columns includes Korea or Africa and set nationality accordingly.
df.loc[df.famous_for.str.contains('Korean'), 'nationality']='South Korean'
df.loc[df.famous_for.str.contains('Africa'), 'nationality']='South Africa'
df
Out[783]:
Cause_of_death famous_for name nationality
0 suicide by hanging African jazz XYZ South Africa
1 unknown Korean president ABC South Korean
2 heart attack businessman EFG American
3 heart failure Prime Minister LMN Indian
4 heart problems African writer PQR South Africa
Or you can do this in one line using:
df.nationality = (
df.nationality.str.cat(df.famous_for.str.extract('(Africa|Korea)',expand=False),
sep=' ', na_rep=''))
df
Out[801]:
Cause_of_death famous_for name nationality
0 suicide by hanging African jazz XYZ South Africa
1 unknown Korean president ABC South Korea
2 heart attack businessman EFG American
3 heart failure Prime Minister LMN Indian
4 heart problems African writer PQR South Africa
Upvotes: 2
Reputation: 862481
If many conditions is possible use custom function with DataFrame.apply
and axis=1
for process by rows:
def f(x):
if (x['nationality']=='South'):
if 'Korea' in x['famous_for']:
return 'South Korea'
elif 'Africa' in x['famous_for']:
return 'South Africa'
else:
return x['nationality']
deaths['nationality'] = deaths.apply(f, axis=1)
print (deaths)
Cause_of_death famous_for name nationality
0 suicide by hanging African jazz XYZ South Africa
1 unknown Korean president ABC South Korea
2 heart attack businessman EFG American
3 heart failure Prime Minister LMN Indian
4 heart problems African writer PQR South Africa
But if only few conditions use str.contains
with DataFrame.loc
:
mask1 = deaths['nationality'] == 'South'
mask2 = deaths['famous_for'].str.contains('Korean')
mask3 = deaths['famous_for'].str.contains('Africa')
deaths.loc[mask1 & mask2, 'nationality']='South Korea'
deaths.loc[mask1 & mask3, 'nationality']='South Africa'
print (deaths)
0 suicide by hanging African jazz XYZ South Africa
1 unknown Korean president ABC South Korea
2 heart attack businessman EFG American
3 heart failure Prime Minister LMN Indian
4 heart problems African writer PQR South Africa
Another solution with mask
:
mask1 = deaths['nationality'] == 'South'
mask2 = deaths['famous_for'].str.contains('Korean')
mask3 = deaths['famous_for'].str.contains('Africa')
deaths['nationality'] = deaths['nationality'].mask(mask1 & mask2, 'South Korea')
deaths['nationality'] = deaths['nationality'].mask(mask1 & mask3,'South Africa')
print (deaths)
0 suicide by hanging African jazz XYZ South Africa
1 unknown Korean president ABC South Korea
2 heart attack businessman EFG American
3 heart failure Prime Minister LMN Indian
4 heart problems African writer PQR South Africa
Upvotes: 1