Reputation: 471
I have a the following pandas table
df:
EVNT_ID col1 col2 col3 col4
123454 1 Nan 4 5
628392 Nan 3 Nan 7
293899 2 Nan Nan 6
127820 9 11 12 19
Now I am trying to concat all the columns except the first column and I want my data frame to look in the following way
new_df:
EVNT_ID col1 col2 col3 col4 new_col
123454 1 Nan 4 5 1|4|5
628392 Nan 3 Nan 7 3|7
293899 2 Nan Nan 6 2|6
127820 9 11 12 19 9|11|12|19
I am using the following code
df['new_column'] = df[~df.EVNT_ID].apply(lambda x: '|'.join(x.dropna().astype(str).values), axis=1)
but it is giving me the following error
ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
I would really appreciate if any one can give me where I am wrong. I'd really appreciate that.
Upvotes: 1
Views: 1279
Reputation: 2122
import time
import timeit
from pandas import DataFrame
import numpy as np
import pandas as pd
from datetime import datetime
df = pd.DataFrame({
'date' : ['05/9/2023', '07/10/2023', '08/11/2023', '06/12/2023'],
'A' : [1, np.nan,4, 7],
'B' : [2, np.nan, 5, 8],
'C' : [3, 6, 9, np.nan]
}).set_index('date')
print(df)
print('.........')
start_time = datetime.now()
df['ColumnA'] = df[df.columns].agg(
lambda x: ','.join(x.dropna().astype(str)),
axis=1
)
print(df['ColumnA'])
end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))
"""
A B C
date
05/9/2023 1.0 2.0 3.0
07/10/2023 NaN NaN 6.0
08/11/2023 4.0 5.0 9.0
06/12/2023 7.0 8.0 NaN
...........................
OUTPUT:
date
05/9/2023 1.0,2.0,3.0
07/10/2023 6.0
08/11/2023 4.0,5.0,9.0
06/12/2023 7.0,8.0
Name: ColumnA, dtype: object
Duration: 0:00:00.002998
"""
Upvotes: 0
Reputation: 30971
Try the following code:
df['new_col'] = df.iloc[:, 1:].apply(lambda x:
'|'.join(str(el) for el in x if str(el) != 'nan'), axis=1)
Initially I thought about x.dropna()
instead of x if str(el) != 'nan'
,
but %timeit
showed that dropna()
works much slower.
Upvotes: 2
Reputation: 402323
You can do this with filter
and agg
:
df.filter(like='col').agg(
lambda x: x.dropna().astype(int).astype(str).str.cat(sep='|'), axis=1)
0 1|4|5
1 3|7
2 2|6
3 9|11|12|19
dtype: object
Or,
df.drop('EVNT_ID', 1).agg(
lambda x: x.dropna().astype(int).astype(str).str.cat(sep='|'), axis=1)
0 1|4|5
1 3|7
2 2|6
3 9|11|12|19
dtype: object
If performance is important, you can use a list comprehension:
joined = [
'|'.join([str(int(x)) for x in r if pd.notna(x)])
for r in df.iloc[:,1:].values.tolist()
]
joined
# ['1|4|5', '3|7', '2|6', '9|11|12|19']
df.assign(new_col=joined)
EVNT_ID col1 col2 col3 col4 new_col
0 123454 1.0 NaN 4.0 5 1|4|5
1 628392 NaN 3.0 NaN 7 3|7
2 293899 2.0 NaN NaN 6 2|6
3 127820 9.0 11.0 12.0 19 9|11|12|19
If you can forgive the overhead of assignment to a DataFrame, here's timings for the two fastest solutions here.
df = pd.concat([df] * 1000, ignore_index=True)
# In this post.
%%timeit
[
'|'.join([str(int(x)) for x in r if pd.notna(x)])
for r in df.iloc[:,1:].values.tolist()
]
# RafaelC's answer.
%%timeit
[
'|'.join([k for k in a if k])
for a in zip(*df.fillna('').astype(str).iloc[:, 1:].values.tolist())
]
31.9 ms ± 800 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
23.7 ms ± 409 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Although note the answers aren't identical because @RafaelC's code produces floats: ['1.0|2.0|9.0', '3.0|11.0', ...]
. If this is fine, then great. Otherwise you'll need to convert to int which adds more overhead.
Upvotes: 1
Reputation: 59274
Using list comprehension and zip
>>> [['|'.join([k for k in a if k])] for a in zip(*df.fillna('').astype(str).iloc[:, 1:].values)]
Timing seems alright
df = pd.concat([df]*1000)
%timeit [['|'.join([k for k in a if k])] for a in zip(*df.fillna('').astype(str).iloc[:, 1:].values)]
10.8 ms ± 568 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.filter(like='col').agg(lambda x: x.dropna().astype(int).astype(str).str.cat(sep='|'), axis=1)
1.68 s ± 91.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.iloc[:, 1:].apply(lambda x: '|'.join(str(el) for el in x if str(el) != 'nan'), axis=1)
87.8 ms ± 5.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.assign(new_col=['|'.join([str(int(x)) for x in r if ~np.isnan(x)]) for r in df.iloc[:,1:].values])
45.1 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 1