Reputation: 195
I have a following DataFrame:
import pandas as pd
ds = pd.DataFrame({'place': [1,2,3], 'date': ['6/7/2021','1/1/2031','1/4/2011'],
'city':['Moscow','New York','Sidney'], 'kids':[5,3,1]})
that looks like this:
city date kids place
Moscow 6/7/2021 5 1
New York 1/1/2031 3 2
Sidney 1/4/2011 1 3
I need to add a new column 'key' to a DataFrame, values in which would be a string that would be a union of values in each row of certain columns: 'city', 'date', 'place' with an underscore ('_') as a delimiter.
Eventual goal:
city date kids place key
Moscow 6/7/2021 5 1 Moscow_6/7/2021_1
New York 1/1/2031 3 2 New York_1/1/2031_2
Sidney 1/4/2011 1 3 Sidney_1/4/2011_3
First of all, I've converted integers in a 'place' column into strings:
df['place'].apply(str)
Secondly, I've removed columns I don't need (but actually this column should be in a resulting df):
col_list= list(df)
col_list.remove('kids')
Then I've tried to create a new column and to fill it with appended strings:
df['key'] = df[col_list].apply(lambda x: x.sum())
But it returns me NaN values in a 'key' column and actually values still appended not with an undersoce as a delimiter.
Upvotes: 2
Views: 2056
Reputation: 394209
You can just add the columns similar to building a str
, you just need to cast the last column to str
:
In[87]:
df['key'] = df['city'] + '_' + df['date'] + '_' + df['place'].astype(str)
df
Out[87]:
city date kids place key
0 Moscow 6/7/2021 5 1 Moscow_6/7/2021_1
1 New York 1/1/2031 3 2 New York_1/1/2031_2
2 Sidney 1/4/2011 1 3 Sidney_1/4/2011_3
Timings With a 30k row df:
%timeit df['city'] + '_' + df['date'] + '_' + df['place'].astype(str)
df
%timeit df[['city', 'date', 'place']].astype(str).apply('_'.join, 1)
%timeit df[['city', 'date', 'place']].astype(str).transform('_'.join, 1)
100 loops, best of 3: 9 ms per loop
10 loops, best of 3: 84 ms per loop
10 loops, best of 3: 83 ms per loop
This method seems to be 10x faster, probably as it's just a straight vectorised addition, apply
is a cythonised for
loop, it looks like transform
needs to do some form of grouping
, sorting, and then creating a new object with index aligned to original df which is where the time is being taken is my guess
Other @JohnGalt method:
%timeit pd.Series(df[['city', 'date', 'place']].astype(str).values.tolist()).str.join('_')
100 loops, best of 3: 10.5 ms per loop
So this is marginally slower than my answer.
Upvotes: 4
Reputation: 4770
apply could also do the job:
df[key] = df[columns].apply(lambda x: '_'.join([str(y) for y in x]))
Upvotes: 0
Reputation: 402872
Try df.assign
with df.transform
(v0.20+):
In [203]: df.assign(key=df[['city', 'date', 'place']].astype(str).transform('_'.join, 1))
Out[203]:
city date kids place key
0 Moscow 6/7/2021 5 1 Moscow_6/7/2021_1
1 New York 1/1/2031 3 2 New York_1/1/2031_2
2 Sidney 1/4/2011 1 3 Sidney_1/4/2011_3
Note that df.assign
is not inplace, to persist the changes, you'd do: df = df.assign(.....)
.
Upvotes: 2
Reputation: 76987
You could make it more generic with apply
and join
In [485]: ds[['city', 'date', 'place']].astype(str).apply('_'.join, 1)
Out[485]:
0 Moscow_6/7/2021_1
1 New York_1/1/2031_2
2 Sidney_1/4/2011_3
dtype: object
In [486]: ds['key'] = ds[['city', 'date', 'place']].astype(str).apply('_'.join, 1)
In [487]: ds
Out[487]:
city date kids place key
0 Moscow 6/7/2021 5 1 Moscow_6/7/2021_1
1 New York 1/1/2031 3 2 New York_1/1/2031_2
2 Sidney 1/4/2011 1 3 Sidney_1/4/2011_3
Or, to use .str
methods, you could
In [493]: pd.Series(ds[['city', 'date', 'place']].astype(str).values.tolist()).str.join('_')
Out[493]:
0 Moscow_6/7/2021_1
1 New York_1/1/2031_2
2 Sidney_1/4/2011_3
dtype: object
Upvotes: 4