Reputation: 23
I'm trying to group some columns into a list inside a single column. In case one of these columns contain NaN, the result column should be just NaN instead of the list.
df = pd.DataFrame({'a.0':[11, 1.1], 'a.5':[12, 1.2], 'a.10':[13, pd.np.NaN]})
The result column of the DF should be as such:
a.0 a.10 a.5 result
0 . 11.0 13.0 12.0 [11, 13, 12]
1 . 1.1 nan 1.2 nan
These 2 lines do the job:
df['result'] = df[['a.0','a.10','a.5']].values.tolist()
df['result'] = pd.np.where(df[['a.0','a.10','a.5']].isnull().any(axis=1), pd.np.nan, df['result'])
And I was wondering how to do it in one line. Help would be appreciated
Upvotes: 2
Views: 1278
Reputation: 323226
df['result']=df.apply(lambda x : pd.Series([x.tolist()]) if ~x.isnull().any() else np.nan,1)
df
Out[30]:
a.0 a.10 a.5 result
0 11.0 13.0 12.0 [11.0, 13.0, 12.0]
1 1.1 NaN 1.2 NaN
Upvotes: 2
Reputation: 402413
Using isnull
+ any
+ mask
/where
df['result'] = pd.Series(df.values.tolist()).mask(df.isnull().any(1))
Or,
df['result'] = pd.Series(df.values.tolist()).where(~df.isnull().any(1))
a.0 a.10 a.5 result
0 11.0 13.0 12.0 [11.0, 13.0, 12.0]
1 1.1 NaN 1.2 NaN
Performance
df = pd.concat([df] * 100000, ignore_index=True)
# Wen's solution
%timeit df.apply(lambda x : pd.Series([x.tolist()]) if ~x.isnull().any() else np.nan,1)
1 loop, best of 3: 1min 37s per loop
# Anton vBR's solution
%timeit [np.nan if np.isnan(v).any() else list(v[1:]) for v in df.itertuples()]
1 loop, best of 3: 5.79 s per loop
# my answer
%timeit pd.Series(df.values.tolist()).mask(df.isnull().any(1))
10 loops, best of 3: 133 ms per loop
Conclusions
apply
is inefficient, more so than a loop. An apply
call typically has many overheads associated with it. A pure python loop is generally faster.where
/mask
will vectorize operations and offers improved performance over loopy solutions.Upvotes: 2
Reputation: 18906
Update With timings and large data-sets cᴏʟᴅsᴘᴇᴇᴅ answer is the best. List comprehensions always suffer here. I've updated my previous answer with timings.
You could use itertuples and assign np.nan if any np.nan in row:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a.0':np.random.choice(np.append(np.nan,np.arange(10)), 1000),
'a.5':np.random.choice(10, 1000),
'a.10':np.random.choice(10, 1000)})
# 3 solutions to solve the problem
# Assign with df['results'] =
%timeit [np.nan if np.isnan(v).any() else list(v[1:]) for v in df.itertuples()]
%timeit pd.Series(df.values.tolist()).mask(df.isnull().any(1))
%timeit df.apply(lambda x : pd.Series([x.tolist()]) if ~x.isnull().any() else np.nan,1)
Timings:
100 loops, best of 3: 8.38 ms per loop
1000 loops, best of 3: 772 µs per loop
1 loop, best of 3: 214 ms per loop
Upvotes: 1