Reputation: 43
I have a Pandas Dataframe with batch numbers and I want to exclude entries that appear less than x times. Let us say x is 3 for and the dataframe looks like this:
BatchName
0 001
1 001
2 001
3 002
4 002
5 003
6 003
7 003
As far as I know, I cannot use Boolean indexing straight up because the column contains no information about the other columns. I therfore created a countdict using
Countdict = dict(df.groupby('BatchName')['BatchPosition'].count()
Where 'BatchPosition' is another column uninteresting to this problem.
I then try to do Boolean indexing with the dataframe like:
df[Countdict[df['BatchName'] >= 3 ]
This however does not work as I hoped because it tries to put the hole Series into the dict. How do I make it put each value from the Series into the dict instead?
Upvotes: 1
Views: 372
Reputation: 862671
Here is problem there is different size of aggregate Series after df.groupby('BatchName')['BatchPosition'].count()
, you can use Series.map
with Series.value_counts
for mask with same size like original df
:
df[df['BatchName'].map(df['BatchName'].value_counts()) >= 3]
Your solution should be changed by GroupBy.transform
:
df[df.groupby('BatchName')['BatchName'].transform('count') >= 3]
Details:
print (df['BatchName'].map(df['BatchName'].value_counts()))
0 3
1 3
2 3
3 2
4 2
5 3
6 3
7 3
Name: BatchName, dtype: int64
print (df['BatchName'].map(df['BatchName'].value_counts()) >= 3)
0 True
1 True
2 True
3 False
4 False
5 True
6 True
7 True
Name: BatchName, dtype: bool
print (df.groupby('BatchName')['BatchName'].transform('count'))
0 3
1 3
2 3
3 2
4 2
5 3
6 3
7 3
Name: BatchName, dtype: int64
print (df.groupby('BatchName')['BatchName'].transform('count') >= 3)
0 True
1 True
2 True
3 False
4 False
5 True
6 True
7 True
Name: BatchName, dtype: bool
Upvotes: 1