Pro Q
Pro Q

Reputation: 5016

`isin` fails to detect a row that is in a dataframe

I've been struggling with an error for days, and after many conversations with ChatGPT, finally got it boiled down to this one minimal example:

import pandas as pd

# Create two data frames with duplicate values
goal_df = pd.DataFrame({'user_id': [1], 'sentence_id': [2]})
source_df = pd.DataFrame({'user_id': [1, 1], 'sentence_id': [2, 2]})

# The first assertion passes
assert (goal_df[['user_id', 'sentence_id']].iloc[0] == source_df[['user_id', 'sentence_id']].iloc[0]).all()

# The second assertion fails
assert goal_df[['user_id', 'sentence_id']].iloc[0].isin(source_df[['user_id', 'sentence_id']]).all()

Why does the second assertion fail?

When I print out intermediate values, it looks like even if I replaced the all with any, it would still fail. That is, the isin is saying that the user_id and sentence_id aren't in the source_df at all, despite the line just beforehand proving that they are.

I also thought that maybe it was because there was an indexing issue where the example didn't match the index, as it's required to by isin, however, even if you make source_df = pd.DataFrame({'user_id': [1], 'sentence_id': [2]}), the same behavior occurs (first assert passes, second fails.)

What's going on here?

Upvotes: 1

Views: 243

Answers (3)

Laurent B.
Laurent B.

Reputation: 2263

Let's consider the second assertion, and check the different parts :

type(goal_df[['user_id', 'sentence_id']].iloc[0])
<class 'pandas.core.series.Series'>

type(source_df[['user_id', 'sentence_id']])
<class 'pandas.core.frame.DataFrame'>

Then in second expression you are trying to check if Series elements are in a dataframe.

Problem is that isin Series method is not designed for that.

Like documentation exposed isin method for series can only take a set or list-like as values argument :

values : set or list-like

The sequence of values to test.

Passing in a single string will raise a TypeError. Instead, turn a single string into a list of one element.

So Assertion is considered as False, and an Assertion exception is raised

Proposed script

# Proposed second Assertion
src = source_df[['user_id', 'sentence_id']].to_numpy().flatten() # Dataframe to numpy list array

assert goal_df[['user_id', 'sentence_id']].iloc[0].isin(src).all()

Checking row by row

I just read your comment, try this (Arne new one is excellent) :

assert (goal_df.isin(source_df).sum(axis=1) == len(source_df.columns)).all()

Upvotes: 1

Arne
Arne

Reputation: 10545

As Laurent pointed out, isin() is not the right tool here. Instead, you can extend the approach from your first assertion to the full source_df dataframe by using NumPy broadcasting.

Assuming goal_df has just one row and source_df has any number of rows, while both dataframes have the same number of columns, the following assertion checks that the row from goal_df is present as a row somewhere in source_df.

assert (goal_df.values == source_df.values).all(axis=1).any()

This works because the values attribute gives the values from within the dataframes as NumPy arrays. When these arrays are compared with ==, the first array is broadcast to match the dimensions of the second one, meaning its one row is repeated as many times as is the number of rows in the second dataframe. Then for each position the values from both arrays are compared, resulting in a Boolean array with the same shape as the second dataframe.

With all(axis=1) we collapse each row of this array to a single Boolean value, indicating if all the values in that row are True, i.e. if the corresponding row of the second dataframe completely matches the first dataframe. Finally, any() lets the assertion pass if such a match was found for any of the rows.

Upvotes: 2

JarroVGIT
JarroVGIT

Reputation: 5304

As Laurent above explained; the isin() method is applied to a Series (iloc[int] returns a Series), and checks for each value in that series if it is in the itterable you provide.

If I understand your question right, you are trying to check if a row from one DF exists in another DF, correct? One way of solving that is to do a merge and count the length of the result:

assert len(pd.merge(goal_df, source_df, how='inner')) > 0

Upvotes: 0

Related Questions