Reputation: 28222
In Pandas, How can one column be derived from multiple other columns?
For example, lets say I wanted to annotate my dataset with the correct form of address for each subject. Perhaps to label some plots with -- so I can tell who the results are for.
Take a dataset:
data = [('male', 'Homer', 'Simpson'), ('female', 'Marge', 'Simpson'), ('male', 'Bart', 'Simpson'),('female', 'Lisa', 'Simpson'),('infant', 'Maggie', 'Simpson')]
people = pd.DataFrame(data, columns=["gender", "first_name", "last_name"])
So we have:
gender first_name last_name
0 male Homer Simpson
1 female Marge Simpson
2 male Bart Simpson
3 female Lisa Simpson
4 infant Maggie Simpson
And a function, which I want to apply to each row, storing the result into a new column.
def get_address(gender, first, last):
title=""
if gender=='male':
title='Mr'
elif gender=='female':
title='Ms'
if title=='':
return first + ' '+ last
else:
return title + ' ' + first[0] + '. ' + last
Currently my method is:
people['address'] = map(lambda row: get_address(*row),people.get_values())
gender first_name last_name address
0 male Homer Simpson Mr H. Simpson
1 female Marge Simpson Ms M. Simpson
2 male Bart Simpson Mr B. Simpson
3 female Lisa Simpson Ms L. Simpson
4 infant Maggie Simpson Maggie Simpson
Which works, but it is not elegant. It also feels bad converting to a unindexed list, then assigning back into a indexed column.
Upvotes: 0
Views: 948
Reputation: 4051
What you are looking for is apply(func,axis=1)
This will apply a function row wise through your dataframe.
In your example modify your method get_address to...
def get_address(row):#row is a pandas series with col names as indexes
title=""
gender = row['gender'] #extract gender from pandas series
first = row['first_name'] #extract firstname from pandas series
second = row['last_name'] #extract lastname from pandas series
if gender=='male':
title='Mr'
elif gender=='female':
title='Ms'
if title=='':
return first + ' '+ last
else:
return title + ' ' + first[0] + '. ' + last
then call people.apply(get_address,axis=1)
which returns a new column (Actually this is a pandas series, with the correct indexes, which is how the dataframe knows how to add it as a column correctly) to add it to your dataframe add this code...
people['address'] = people.apply(get_address,axis=1)
Upvotes: 2
Reputation: 25672
You can do this without any explicit looping:
In [70]: df
Out[70]:
gender first_name last_name
0 male Homer Simpson
1 female Marge Simpson
2 male Bart Simpson
3 female Lisa Simpson
4 infant Maggie Simpson
In [71]: title = df.gender.replace({'male': 'Mr', 'female': 'Ms', 'infant': ''})
In [72]: initial = np.where(df.gender != 'infant', df.first_name.str[0] + '. ', df.first_name + ' ')
In [73]: initial
Out[73]: array(['H. ', 'M. ', 'B. ', 'L. ', 'Maggie '], dtype=object)
In [74]: address = (title + ' ' + Series(initial) + df.last_name).str.strip()
In [75]: address
Out[75]:
0 Mr H. Simpson
1 Ms M. Simpson
2 Mr B. Simpson
3 Ms L. Simpson
4 Maggie Simpson
dtype: object
Check out the documentation for Series.str
methods, they're pretty rad. Most methods from str
are implemented in addition to goodies like extract
.
Upvotes: 1