Reputation: 813
This question is a sequel of a previous one I had posted here: Slicing Pandas Dataframe according to number of lines . I've had nice answers which solved the problem. Nevertheless, when trying the solution a different way, I don't get what I expect and, despite many tests, I don't understand why.
Suppose I have a pandas dataframe df containing a 'Group' Id (there can of course be many objects in one group) and a quantity, say 'R'. I want to construct another df with groups of at least 4 objects and for which the 4th object, when sorted by R, is lower than R_min (I know it sounds weird to call a maximum 'R_min', but they are galaxies magnitudes, which are negative, the lower the brighter - or the higher absolute value the brighter). Here is a mock DataFrame constructed for the problem:
df = pd.DataFrame({ 'R' : (-21,-21,-22,-3,-23,-24,-20,-19,-34,-35,-30,-5,-25,-6,-7,-22,-21,-10,-11,-12,-13,-14,-15),
....: 'Group': (1,1,1,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5,5) })
The solution to my problem is this one, which seems to work perfectly:
R_min = -18.8
df_processed = (df[df.Group.map(df.Group.value_counts().ge(4))]
.groupby('Group').filter(lambda x: np.any(x.sort_values('R').iloc[3] <= R_min)))
I agree, group 3 is the only one left under my constraint. Now, for verification and to know how is structured my galaxy group catalogue, I check what are the ones left among those having at least four member. I expect a code like the following to work exactly the same:
df_left = (df[df.Group.map(df.Group.value_counts().ge(4))]
.groupby('Group').filter(lambda x: np.any(x.sort_values('R').iloc[3] > R_min)))
Unfortunately, it does not:
The most stricking point here being that group 3 is also in df_left! Sorted by R, group 3 gives -35, -34, -30, -19, -5, of which 4th value is -19, which is lower than -18.8 . How comes? Is the selection method wrong? How should I correct it?
Many thanks
Upvotes: 2
Views: 72
Reputation: 61967
You are receiving this error because x.sort_values('R').iloc[3]
is sorting a DataFrame and not the Series consisting of just column R
. This means that when you call np.any
it is checking for any of the columns including column Group
to see whether it is greater than R_min
and since all values are positive for Group
this will return true.
Your code is also highly suboptimal. You should do this instead:
R_min = -18.8
df.groupby('Group').filter(lambda x: (x.shape[0] >= 4) & (x['R'].nsmallest(4).iloc[-1] <= R_min))
Upvotes: 1