A bugged DataFrame slicing?

Question

This question is a sequel of a previous one I had posted here: Slicing Pandas Dataframe according to number of lines . I've had nice answers which solved the problem. Nevertheless, when trying the solution a different way, I don't get what I expect and, despite many tests, I don't understand why.

Suppose I have a pandas dataframe df containing a 'Group' Id (there can of course be many objects in one group) and a quantity, say 'R'. I want to construct another df with groups of at least 4 objects and for which the 4th object, when sorted by R, is lower than R_min (I know it sounds weird to call a maximum 'R_min', but they are galaxies magnitudes, which are negative, the lower the brighter - or the higher absolute value the brighter). Here is a mock DataFrame constructed for the problem:

df = pd.DataFrame({ 'R'       : (-21,-21,-22,-3,-23,-24,-20,-19,-34,-35,-30,-5,-25,-6,-7,-22,-21,-10,-11,-12,-13,-14,-15),
   ....:            'Group': (1,1,1,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5,5) })

The solution to my problem is this one, which seems to work perfectly:

R_min = -18.8
df_processed = (df[df.Group.map(df.Group.value_counts().ge(4))]
   .groupby('Group').filter(lambda x: np.any(x.sort_values('R').iloc[3] <= R_min)))

I agree, group 3 is the only one left under my constraint. Now, for verification and to know how is structured my galaxy group catalogue, I check what are the ones left among those having at least four member. I expect a code like the following to work exactly the same:

df_left = (df[df.Group.map(df.Group.value_counts().ge(4))]
       .groupby('Group').filter(lambda x: np.any(x.sort_values('R').iloc[3] > R_min)))

Unfortunately, it does not:

The most stricking point here being that group 3 is also in df_left! Sorted by R, group 3 gives -35, -34, -30, -19, -5, of which 4th value is -19, which is lower than -18.8 . How comes? Is the selection method wrong? How should I correct it?

Many thanks

Ted Petrou · Accepted Answer

You are receiving this error because x.sort_values('R').iloc[3] is sorting a DataFrame and not the Series consisting of just column R. This means that when you call np.any it is checking for any of the columns including column Group to see whether it is greater than R_min and since all values are positive for Group this will return true.

Your code is also highly suboptimal. You should do this instead:

R_min = -18.8
df.groupby('Group').filter(lambda x: (x.shape[0] >= 4) & (x['R'].nsmallest(4).iloc[-1] <= R_min))

A bugged DataFrame slicing?

Answers (1)

Related Questions