Reputation: 414
Apologies for not being able to provide the data. Somebody else wrote this code, and I don't understand how it's working.
There's a dataframe (df
) that's say, 100 samples long. They grouped it:
[EDIT TO QUESTION: I forgot to include that the groupby statement ended with an index reset. Adding that below.]
grouped_df = df.groupby('col_a').sum()['col_b'].sort_values().reset_index()
This resulted in a DataFrame object of length 10.
Then they created a Boolean series to use as a mask. They created it from the original dataframe (df
) based on values in a third column:
mask = df['col_c'] > 10
This resulted in a Boolean series of length 100—same length as df
, naturally.
Then they applied mask
(len=100) to grouped_df
(len=10), and the result was a DataFrame object of length 5.
How does that work? What is happening? How can you apply a Boolean series to a dataframe as a mask when the lengths don't match up?
Upvotes: 0
Views: 61
Reputation: 37902
Update :
That's because pandas silently aligns the indexes of the grouped_df
with the boolean mask
.
Here is a configuration that would lead to a similar scenario :
np.random.seed(20)
df = pd.DataFrame({
"col_a": np.random.choice(list("ABCDEFGHIJ"), 100),
"col_b": np.random.randint(0, 20, 100),
"col_c": np.random.randint(0, 30, 100)
})
grouped_df = (
df.groupby("col_a").sum()["col_b"]
.sort_values().reset_index() # length of 10
)
mask = df["col_c"] > 10 # length of 100
out = grouped_df[mask] # length of 5
Output :
print(out)
col_a col_b
1 H 52
2 C 70
3 I 70
6 F 87
9 G 190
Intermediates/Details :
#A friendly warning
UserWarning: Boolean Series key will be reindexed to match DataFrame index.
>>> grouped_df
col_a col_b
0 B 41
1 H 52
2 C 70
3 I 70
4 D 76
5 E 86
6 F 87
7 A 107
8 J 116
9 G 190
>>> grouped_df.index
RangeIndex(start=0, stop=10, step=1)
>>> mask[mask.eq(True)].index
Index([ 1, 2, 3, 6, 9, 10, 13, 14, 15, 17, 18, 19, 20, 23, 24, 25, 26, 27,
29, 30, 33, 34, 35, 37, 38, 39, 40, 41, 43, 44, 45, 49, 50, 51, 53, 54,
61, 63, 64, 67, 68, 69, 70, 71, 74, 75, 78, 79, 81, 83, 84, 85, 88, 89,
90, 91, 92, 93, 94, 96, 97],
dtype='int64')
>>> grouped_df.index.intersection(mask[mask.eq(True)].index)
Index([1, 2, 3, 6, 9], dtype='int64')
Upvotes: 1