Camilla
Camilla

Reputation: 131

find if the value in pandas column is in another column's list

I have a df that looks like this:

    ID       dates                              holidays
0  ABC  01/01/2022  [12/25/2021, 01/01/2022, 01/17/2022]
1  DEF  03/17/2022  [01/01/2022, 06/01/2022, 27/01/2022]

I need to create another column isHoliday with bool value if the value in column dates is in holidays list for that id. I am trying to use the code below but it shows an error: TypeError: unhashable type: 'list':

df['isHoliday'] = np.where(df['dates'].isin(df['holidays']), True, False)

Is it any way to find whether the columns dates value is in the list that corresponds each ID?

Upvotes: 0

Views: 3386

Answers (4)

user17242583
user17242583

Reputation:

explode works really well here:

df['isHoliday'] = df.eval('dates in holidays.explode()')

Output:

>>> df
    ID       dates                              holidays  isHoliday
0  ABC  01/01/2022  [12/25/2021, 01/01/2022, 01/17/2022]       True
1  DEF  03/17/2022  [01/01/2022, 06/01/2022, 27/01/2022]      False

Benchmarks:

def scott():
    [d in l for d, l in zip(df['dates'], df['holidays'])]

def shubham():
    pd.DataFrame([*df['holidays']]).eq(df['dates'], axis=0).any(1)

def charles():
    df[['dates','holidays']].apply(lambda row: row['dates'] in row['holidays'], axis=1)

def richardec():
    df.eval('dates in holidays.explode()')

###

%timeit scott()
# 5.99 µs ± 302 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

%timeit shubham()
# 547 µs ± 10.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit charles()
# 525 µs ± 2.27 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit richardec()
# 971 µs ± 71.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

So mine is the shortest, but Scott's is the fastest!

Upvotes: 5

Shubham Sharma
Shubham Sharma

Reputation: 71707

Here is a vectorized approach:

df['isholiday'] = pd.DataFrame([*df['holidays']]).eq(df['dates'], axis=0).any(1)

Result

    ID       dates                                holidays  isholiday
0  ABC  01/01/2022    [12/25/2021, 01/01/2022, 01/17/2022]       True
1  DEF  03/17/2022  [01/01/2022,  06/01/2022,  27/01/2022]      False

Upvotes: 3

Scott Boston
Scott Boston

Reputation: 153540

Try this:

df['isholiday'] = [d in l for d, l in zip(df['dates'], df['holidays'])]

Output:

    ID       dates                              holidays  isholiday
0  ABC  01/01/2022  [12/25/2021, 01/01/2022, 01/17/2022]       True
1  DEF  03/17/2022  [01/01/2022, 06/01/2022, 01/27/2022]      False

Upvotes: 4

Charles Bushrow
Charles Bushrow

Reputation: 477

The source of the issue is that the .isin() method does not vectorize its argument, so it's checking if each date is in the whole series of lists, and in the process I believe it casts the argument to a set for speed purposes. Sets only accept immutable values, and lists are mutable, so it gives the unhashable type error.

I think you'll want to use an apply instead:

df['isHoliday'] = df[['dates','holidays']].apply(
    lambda row: row['dates'] in row['holidays'], 
    axis=1
)

In this case you don't need the np.where because the condition itself returns a series of Trues and Falses.

Upvotes: 3

Related Questions