Reputation: 5385
I have a pandas data frame like this:
Column1 Column2 Column3 Column4 Column5
0 a 1 2 3 4
1 a 3 4 5
2 b 6 7 8
3 c 7 7
What I want to do now is getting a new dataframe containing Column1 and a new columnA. This columnA should contain all values from columns 2 -(to) n (where n is the number of columns from Column2 to the end of the row) like this:
Column1 ColumnA
0 a 1,2,3,4
1 a 3,4,5
2 b 6,7,8
3 c 7,7
How could I best approach this issue?
Upvotes: 99
Views: 237377
Reputation: 263
Do NOT use apply, it does not scale well. Instead use df.agg(). Using apply() will take seconds, but agg() will take milliseconds (ms).
import numpy as np
import pandas as pd
def createList(r1, r2):
return np.arange(r1, r2+1, 1)
sample_data = createList(1, 100_000) # a list of 100,000 values
test_df = pd.DataFrame(
[sample_data]
)
test_df.apply(lambda x: ','.join(x.astype(str))) #3.47 s ± 24.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
test_df.astype(str).agg(', '.join, axis=1) #34.8 ms ± 407 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
As you can see from this sample, apply() took an average time of 3.47 seconds whereas agg() took an average time of 34.8 milliseconds. The gap in performance will become bigger as more data is added too.
*Note, I used %%timeit in jupyter notebook to get the run time for each method.
Upvotes: 10
Reputation: 241
I propose to use .assign
df2 = df.assign(ColumnA = df.Column2.astype(str) + ', ' + \
df.Column3.astype(str) + ', ' df.Column4.astype(str) + ', ' \
df.Column4.astype(str) + ', ' df.Column5.astype(str))
it's simple, maybe long but it worked for me
Upvotes: 24
Reputation: 394051
You can call apply
pass axis=1
to apply
row-wise, then convert the dtype to str
and join
:
In [153]:
df['ColumnA'] = df[df.columns[1:]].apply(
lambda x: ','.join(x.dropna().astype(str)),
axis=1
)
df
Out[153]:
Column1 Column2 Column3 Column4 Column5 ColumnA
0 a 1 2 3 4 1,2,3,4
1 a 3 4 5 NaN 3,4,5
2 b 6 7 8 NaN 6,7,8
3 c 7 7 NaN NaN 7,7
Here I call dropna
to get rid of the NaN
, however we need to cast again to int
so we don't end up with floats as str.
Upvotes: 160
Reputation: 2881
If you have lot of columns say - 1000 columns in dataframe and you want to merge few columns based on particular column name
e.g. -Column2
in question and arbitrary no. of columns after that column (e.g. here 3 columns after 'Column2
inclusive of Column2
as OP asked).
We can get position of column using .get_loc()
- as answered here
source_col_loc = df.columns.get_loc('Column2') # column position starts from 0
df['ColumnA'] = df.iloc[:,source_col_loc+1:source_col_loc+4].apply(
lambda x: ",".join(x.astype(str)), axis=1)
df
Column1 Column2 Column3 Column4 Column5 ColumnA
0 a 1 2 3 4 1,2,3,4
1 a 3 4 5 NaN 3,4,5
2 b 6 7 8 NaN 6,7,8
3 c 7 7 NaN NaN 7,7
To remove NaN
, use .dropna()
or .fillna()
Hope it helps!
Upvotes: 12