Pandas group by column unless one of the values in the group is a certain value

Question

I have a df like this:

X  y
a  0
a  0
a  0
a  1
b  1
b  1
a  2
b  2
c  2

I want to group df by df.y. In this grouping, I want to aggregate df.X in a specific way:

If 'b' or 'c' make up >= 25% of units in group, assign 'b' or 'c' to the group (depending on which is higher or if tied, then both)
Else, assign 'a' to the group.

It should look like this:

X      y
a      0
b      1
[b,c]  2

Edit

When I run:

mask1 = s[['b','c']].ge(0.25).any(1)

s1 = np.where(s['b']==s['c'], {'b','c'}, s[['b','c']].idxmax(1))

pd.Series(np.where(mask1, s1, 'a'), index=s.index)

I get this error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
 in 
----> 1 mask1 = s[['b','c']].ge(0.25).any(1)
      2 
      3 s1 = np.where(s['b']==s['c'], {'b','c'}, s[['b','c']].idxmax(1))
      4 
      5 pd.Series(np.where(mask1, s1, 'a'), index=s.index)

~/anaconda3/envs/cv2/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2903             if is_iterator(key):
   2904                 key = list(key)
-> 2905             indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
   2906 
   2907         # take() does not accept boolean indexers

~/anaconda3/envs/cv2/lib/python3.6/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
   1252             keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
   1253 
-> 1254         self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
   1255         return keyarr, indexer
   1256 

~/anaconda3/envs/cv2/lib/python3.6/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
   1296             if missing == len(indexer):
   1297                 axis_name = self.obj._get_axis_name(axis)
-> 1298                 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   1299 
   1300             # We (temporarily) allow for some missing keys with .loc, except in

KeyError: "None of [Index(['b', 'c'], dtype='object')] are in the [columns]"

Quang Hoang · Accepted Answer

It's pretty straightforward from your logic:

s = (df.groupby('y')['X'].value_counts(normalize=True)
       .unstack('X', fill_value=0)
    )

mask1 = s[['b','c']].ge(0.25).any(1)

s1 = np.where(s['b']==s['c'], {'b','c'}, s[['b','c']].idxmax(1))

pd.Series(np.where(mask1, s1, 'a'), index=s.index)

Output:

y
0         a
1         b
2    {c, b}
dtype: object

Pandas group by column unless one of the values in the group is a certain value

Edit

Answers (1)

Related Questions