Boolean indexing with dict lookup

Question

I have a Pandas Dataframe with batch numbers and I want to exclude entries that appear less than x times. Let us say x is 3 for and the dataframe looks like this:

As far as I know, I cannot use Boolean indexing straight up because the column contains no information about the other columns. I therfore created a countdict using

Countdict = dict(df.groupby('BatchName')['BatchPosition'].count()

Where 'BatchPosition' is another column uninteresting to this problem.

I then try to do Boolean indexing with the dataframe like:

df[Countdict[df['BatchName'] >= 3 ]

This however does not work as I hoped because it tries to put the hole Series into the dict. How do I make it put each value from the Series into the dict instead?

jezrael · Accepted Answer

Here is problem there is different size of aggregate Series after df.groupby('BatchName')['BatchPosition'].count(), you can use Series.map with Series.value_counts for mask with same size like original df:

df[df['BatchName'].map(df['BatchName'].value_counts()) >= 3]

Your solution should be changed by GroupBy.transform:

df[df.groupby('BatchName')['BatchName'].transform('count') >= 3]

Details:

print (df['BatchName'].map(df['BatchName'].value_counts()))
0    3
1    3
2    3
3    2
4    2
5    3
6    3
7    3
Name: BatchName, dtype: int64

print (df['BatchName'].map(df['BatchName'].value_counts()) >= 3)
0     True
1     True
2     True
3    False
4    False
5     True
6     True
7     True
Name: BatchName, dtype: bool

print (df.groupby('BatchName')['BatchName'].transform('count'))
0    3
1    3
2    3
3    2
4    2
5    3
6    3
7    3
Name: BatchName, dtype: int64

print (df.groupby('BatchName')['BatchName'].transform('count') >= 3)
0     True
1     True
2     True
3    False
4    False
5     True
6     True
7     True
Name: BatchName, dtype: bool

Boolean indexing with dict lookup

Answers (1)

Related Questions