bogdanCsn
bogdanCsn

Reputation: 1325

How to test for list equality in a column where cells are lists

I want to be able to test if some cells that are lists are equal to [0] and Var1==4, and set a new column to 1 if this happens. Input and expected output are below.
I had several tries but only managed with apply and lambda , and this does not scale well for 50k+ rows. Is there a faster way I'm missing?
Input:

import numpy as np
import pandas as pd


df = pd.DataFrame({'Id': [1,2,3,4],
                   'Var1': [[0,1],[0],[6,7],[0]],
                  })

Id    Var1
1  [0, 1]
2     [0]
3  [6, 7]
4     [0]

What I've tried:

df['ERR'] = 0
df.loc[(df['Id']==4) & (df['Var1']==[0]) , 'ERR'] = 1     # doesn't work
df.loc[(df['Id']==4) & (df['Var1'].isin([0])) , 'ERR'] = 1 # doesn't work
df['ERR'] = df.apply(lambda x: 1 if x['Id']==4 and x['Var1']==[0]   else 0 , axis = 1)

Expected output:

Id    Var1  ERR
 1  [0, 1]    0
 2     [0]    0
 3  [6, 7]    0
 4     [0]    1

Upvotes: 0

Views: 86

Answers (1)

jezrael
jezrael

Reputation: 862851

You can compare by tuple or set:

df['ERR1'] = ((df['Id']==4) & (df['Var1'].apply(tuple)==(0, ))).astype(int)
df['ERR2'] = ((df['Id']==4) & ([tuple(x) ==(0, )  for x in df['Var1']])).astype(int)

df['ERR3'] = ((df['Id']==4) & (df['Var1'].apply(set)==set([0]))).astype(int)
df['ERR4'] = ((df['Id']==4) & ([set(x) == set([0])  for x in df['Var1']])).astype(int)

Performance (depends of input data):

df = pd.DataFrame({'Id': [1,2,3,4],
                   'Var1': [[0,1],[0],[6,7],[0]],
                  })
df = pd.concat([df] * 10000, ignore_index=True)


In [188]: %timeit df['ERR1'] = ((df['Id']==4) & (df['Var1'].apply(tuple)==(0, ))).astype(int)
13.1 ms ± 318 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [189]: %timeit df['ERR2'] = ((df['Id']==4) & ([tuple(x) ==(0, )  for x in df['Var1']])).astype(int)
8.98 ms ± 266 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [190]: %timeit df['ERR3'] = ((df['Id']==4) & (df['Var1'].apply(set)==set([0]))).astype(int)
17 ms ± 451 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [191]: %timeit df['ERR4'] = ((df['Id']==4) & ([set(x) == set([0])  for x in df['Var1']])).astype(int)
19.4 ms ± 93.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upvotes: 2

Related Questions