NaiveBae
NaiveBae

Reputation: 414

In pandas, how is it possible for a Boolean series of len(100) to be applied to a dataframe of len(10) without it throwing an error?

Apologies for not being able to provide the data. Somebody else wrote this code, and I don't understand how it's working.

There's a dataframe (df) that's say, 100 samples long. They grouped it:

[EDIT TO QUESTION: I forgot to include that the groupby statement ended with an index reset. Adding that below.]

grouped_df = df.groupby('col_a').sum()['col_b'].sort_values().reset_index()

This resulted in a DataFrame object of length 10.

Then they created a Boolean series to use as a mask. They created it from the original dataframe (df) based on values in a third column:

mask = df['col_c'] > 10

This resulted in a Boolean series of length 100—same length as df, naturally.

Then they applied mask (len=100) to grouped_df (len=10), and the result was a DataFrame object of length 5.

How does that work? What is happening? How can you apply a Boolean series to a dataframe as a mask when the lengths don't match up?

Upvotes: 0

Views: 61

Answers (1)

Timeless
Timeless

Reputation: 37902

Here is my previous answer to the original question.

Update :

That's because pandas silently aligns the indexes of the grouped_df with the boolean mask.

Here is a configuration that would lead to a similar scenario :

np.random.seed(20)

df = pd.DataFrame({
    "col_a": np.random.choice(list("ABCDEFGHIJ"), 100),
    "col_b": np.random.randint(0, 20, 100),
    "col_c": np.random.randint(0, 30, 100)
})

grouped_df = (
    df.groupby("col_a").sum()["col_b"]
    .sort_values().reset_index() # length of 10
)

mask = df["col_c"] > 10 # length of 100

out = grouped_df[mask] # length of 5

Output :

print(out)

  col_a  col_b
1     H     52
2     C     70
3     I     70
6     F     87
9     G    190

Intermediates/Details :

#A friendly warning
UserWarning: Boolean Series key will be reindexed to match DataFrame index.
 
>>> grouped_df
  col_a  col_b
0     B     41
1     H     52
2     C     70
3     I     70
4     D     76
5     E     86
6     F     87
7     A    107
8     J    116
9     G    190

>>> grouped_df.index
RangeIndex(start=0, stop=10, step=1)

>>> mask[mask.eq(True)].index
Index([ 1,  2,  3,  6,  9, 10, 13, 14, 15, 17, 18, 19, 20, 23, 24, 25, 26, 27,
       29, 30, 33, 34, 35, 37, 38, 39, 40, 41, 43, 44, 45, 49, 50, 51, 53, 54,
       61, 63, 64, 67, 68, 69, 70, 71, 74, 75, 78, 79, 81, 83, 84, 85, 88, 89,
       90, 91, 92, 93, 94, 96, 97],
      dtype='int64')

>>> grouped_df.index.intersection(mask[mask.eq(True)].index)
Index([1, 2, 3, 6, 9], dtype='int64')

Upvotes: 1

Related Questions