Prayson W. Daniel
Prayson W. Daniel

Reputation: 15568

Concatenate Pandas column name to column value

Is there any efficient way to concatenate Pandas column name to its value. I will like to prefix all my DataFrame values with their column names.

My current method is very slow on a large dataset:


import pandas as pd

# test data 

df = pd.read_csv(pd.compat.StringIO('''date value data
01/01/2019 30 data1
01/01/2019 40 data2
02/01/2019 20 data1
02/01/2019 10 data2'''), sep=' ')

# slow method 

dt = [df[c].apply(lambda x:f'{c}_{x}').values for c in df.columns]

dt = pd.DataFrame(dt, index=df.columns).T

The problem is that list compression and copying of data slows the transformation on a large dataset with lots of columns.

Is there are better way to prefix columns name to values?

Upvotes: 2

Views: 4353

Answers (3)

Doron Bergman
Doron Bergman

Reputation: 1

This solution:

result = pd.DataFrame({col: col + "_" + m[col].astype(str) for col in m.columns})

is as performant as the fastest solution above, and might be more readable, at least to some.

Upvotes: 0

piRSquared
piRSquared

Reputation: 294278

numpy.core.defchararray.add

from numpy.core.defchararray import add

a = df.to_numpy().astype(str)
b = df.columns.to_numpy().astype(str)

dt = pd.DataFrame(add(add(b, '_'), a), df.index, df.columns)
dt

              date     value        data
0  date_01/01/2019  value_30  data_data1
1  date_01/01/2019  value_40  data_data2
2  date_02/01/2019  value_20  data_data1
3  date_02/01/2019  value_10  data_data2

This isn't as fast as the fastest answer but it's pretty zippy (see what I did there)

a = df.columns.tolist()
pd.DataFrame(
    [[f'{k}_{v}' for k, v in zip(a, t)]
     for t in zip(*map(df.get, a))],
    df.index, df.columns
)

Upvotes: 2

anky
anky

Reputation: 75080

here is a way without loops:

pd.DataFrame([df.columns]*len(df),columns=df.columns)+"_"+df.astype(str)

              date     value        data
0  date_01/01/2019  value_30  data_data1
1  date_01/01/2019  value_40  data_data2
2  date_02/01/2019  value_20  data_data1
3  date_02/01/2019  value_10  data_data2

Timings (fastest to slowest):

m = pd.concat([df]*20000,ignore_index=True)

%%timeit
m.astype(str).radd(m.columns + '_')
#410 ms ± 39.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
m.astype(str).radd('_').radd([*m]) # courtesy @piR
#470 ms ± 76.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit      #piR solution
a = m.to_numpy().astype(str)
b = m.columns.to_numpy().astype(str)
pd.DataFrame(add(add(b, '_'), a), m.index, m.columns)
#710 ms ± 74.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit #anky_91 sol
pd.DataFrame([m.columns]*len(m),columns=m.columns)+"_"+m.astype(str)
#1.7 s ± 114 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit #OP sol
dt = [m[c].apply(lambda x:f' {c}_{x}').values for c in m.columns]
pd.DataFrame(dt, index=m.columns).T
#14.4 s ± 643 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 3

Related Questions