Ekaterina
Ekaterina

Reputation: 195

Create new column and fill with added strings from same df with underscore as delimiter

I have a following DataFrame:

import pandas as pd
ds = pd.DataFrame({'place': [1,2,3], 'date': ['6/7/2021','1/1/2031','1/4/2011'], 
'city':['Moscow','New York','Sidney'], 'kids':[5,3,1]})

that looks like this:

city        date      kids   place
Moscow    6/7/2021     5       1
New York  1/1/2031     3       2
Sidney    1/4/2011     1       3

I need to add a new column 'key' to a DataFrame, values in which would be a string that would be a union of values in each row of certain columns: 'city', 'date', 'place' with an underscore ('_') as a delimiter.

Eventual goal:

city        date      kids   place   key
Moscow    6/7/2021     5       1     Moscow_6/7/2021_1
New York  1/1/2031     3       2     New York_1/1/2031_2
Sidney    1/4/2011     1       3     Sidney_1/4/2011_3

First of all, I've converted integers in a 'place' column into strings:

df['place'].apply(str)

Secondly, I've removed columns I don't need (but actually this column should be in a resulting df):

col_list= list(df)
col_list.remove('kids')

Then I've tried to create a new column and to fill it with appended strings:

df['key'] = df[col_list].apply(lambda x: x.sum())

But it returns me NaN values in a 'key' column and actually values still appended not with an undersoce as a delimiter.

Upvotes: 2

Views: 2056

Answers (4)

EdChum
EdChum

Reputation: 394209

You can just add the columns similar to building a str, you just need to cast the last column to str:

In[87]:
df['key'] = df['city'] + '_' + df['date'] + '_' + df['place'].astype(str)
df

Out[87]: 
       city      date  kids  place                  key
0    Moscow  6/7/2021     5      1    Moscow_6/7/2021_1
1  New York  1/1/2031     3      2  New York_1/1/2031_2
2    Sidney  1/4/2011     1      3    Sidney_1/4/2011_3

Timings With a 30k row df:

%timeit  df['city'] + '_' + df['date'] + '_' + df['place'].astype(str)
df
%timeit df[['city', 'date', 'place']].astype(str).apply('_'.join, 1)
%timeit df[['city', 'date', 'place']].astype(str).transform('_'.join, 1)

100 loops, best of 3: 9 ms per loop
10 loops, best of 3: 84 ms per loop
10 loops, best of 3: 83 ms per loop

This method seems to be 10x faster, probably as it's just a straight vectorised addition, apply is a cythonised for loop, it looks like transform needs to do some form of grouping, sorting, and then creating a new object with index aligned to original df which is where the time is being taken is my guess

Other @JohnGalt method:

%timeit pd.Series(df[['city', 'date', 'place']].astype(str).values.tolist()).str.join('_')

100 loops, best of 3: 10.5 ms per loop

So this is marginally slower than my answer.

Upvotes: 4

omu_negru
omu_negru

Reputation: 4770

apply could also do the job:

df[key] = df[columns].apply(lambda x: '_'.join([str(y) for y in x]))

Upvotes: 0

cs95
cs95

Reputation: 402872

Try df.assign with df.transform (v0.20+):

In [203]: df.assign(key=df[['city', 'date', 'place']].astype(str).transform('_'.join, 1))
Out[203]: 
       city      date  kids  place                  key
0    Moscow  6/7/2021     5      1    Moscow_6/7/2021_1
1  New York  1/1/2031     3      2  New York_1/1/2031_2
2    Sidney  1/4/2011     1      3    Sidney_1/4/2011_3

Note that df.assign is not inplace, to persist the changes, you'd do: df = df.assign(.....).

Upvotes: 2

Zero
Zero

Reputation: 76987

You could make it more generic with apply and join

In [485]: ds[['city', 'date', 'place']].astype(str).apply('_'.join, 1)
Out[485]:
0      Moscow_6/7/2021_1
1    New York_1/1/2031_2
2      Sidney_1/4/2011_3
dtype: object

In [486]: ds['key'] = ds[['city', 'date', 'place']].astype(str).apply('_'.join, 1)

In [487]: ds
Out[487]:
       city      date  kids  place                  key
0    Moscow  6/7/2021     5      1    Moscow_6/7/2021_1
1  New York  1/1/2031     3      2  New York_1/1/2031_2
2    Sidney  1/4/2011     1      3    Sidney_1/4/2011_3

Or, to use .str methods, you could

In [493]: pd.Series(ds[['city', 'date', 'place']].astype(str).values.tolist()).str.join('_')
Out[493]:
0      Moscow_6/7/2021_1
1    New York_1/1/2031_2
2      Sidney_1/4/2011_3
dtype: object

Upvotes: 4

Related Questions