Hagai Israeli
Hagai Israeli

Reputation: 23

Conditionally assign values to DF column using np.where

I'm trying to group some columns into a list inside a single column. In case one of these columns contain NaN, the result column should be just NaN instead of the list.

df = pd.DataFrame({'a.0':[11, 1.1], 'a.5':[12, 1.2], 'a.10':[13, pd.np.NaN]})

The result column of the DF should be as such:

    a.0   a.10   a.5   result
0 . 11.0  13.0   12.0  [11, 13, 12]
1 . 1.1   nan    1.2   nan

These 2 lines do the job:

df['result'] = df[['a.0','a.10','a.5']].values.tolist()
df['result'] = pd.np.where(df[['a.0','a.10','a.5']].isnull().any(axis=1), pd.np.nan, df['result'])

And I was wondering how to do it in one line. Help would be appreciated

Upvotes: 2

Views: 1278

Answers (3)

BENY
BENY

Reputation: 323226

df['result']=df.apply(lambda x : pd.Series([x.tolist()]) if ~x.isnull().any() else np.nan,1)
df
Out[30]: 
    a.0  a.10   a.5              result
0  11.0  13.0  12.0  [11.0, 13.0, 12.0]
1   1.1   NaN   1.2                 NaN

Upvotes: 2

cs95
cs95

Reputation: 402413

Using isnull + any + mask/where

df['result'] = pd.Series(df.values.tolist()).mask(df.isnull().any(1))

Or,

df['result'] = pd.Series(df.values.tolist()).where(~df.isnull().any(1))

    a.0  a.10   a.5              result
0  11.0  13.0  12.0  [11.0, 13.0, 12.0]
1   1.1   NaN   1.2                 NaN

Performance

df = pd.concat([df] * 100000, ignore_index=True)

# Wen's solution
%timeit df.apply(lambda x : pd.Series([x.tolist()]) if ~x.isnull().any() else np.nan,1)
1 loop, best of 3: 1min 37s per loop

# Anton vBR's solution
%timeit [np.nan if np.isnan(v).any() else list(v[1:]) for v in df.itertuples()]
1 loop, best of 3: 5.79 s per loop

# my answer
%timeit pd.Series(df.values.tolist()).mask(df.isnull().any(1))
10 loops, best of 3: 133 ms per loop

Conclusions

  • apply is inefficient, more so than a loop. An apply call typically has many overheads associated with it. A pure python loop is generally faster.
  • A vectorised approach such as where/mask will vectorize operations and offers improved performance over loopy solutions.

Upvotes: 2

Anton vBR
Anton vBR

Reputation: 18906

Update With timings and large data-sets cᴏʟᴅsᴘᴇᴇᴅ answer is the best. List comprehensions always suffer here. I've updated my previous answer with timings.

You could use itertuples and assign np.nan if any np.nan in row:

import pandas as pd
import numpy as np

df = pd.DataFrame({'a.0':np.random.choice(np.append(np.nan,np.arange(10)), 1000), 
                   'a.5':np.random.choice(10, 1000), 
                   'a.10':np.random.choice(10, 1000)})

# 3 solutions to solve the problem 
# Assign with df['results'] =
%timeit [np.nan if np.isnan(v).any() else list(v[1:]) for v in df.itertuples()]
%timeit pd.Series(df.values.tolist()).mask(df.isnull().any(1))
%timeit df.apply(lambda x : pd.Series([x.tolist()]) if ~x.isnull().any() else np.nan,1)

Timings:

100 loops, best of 3: 8.38 ms per loop
1000 loops, best of 3: 772 µs per loop
1 loop, best of 3: 214 ms per loop

Upvotes: 1

Related Questions