connor449
connor449

Reputation: 1679

Pandas group by column unless one of the values in the group is a certain value

I have a df like this:

X  y
a  0
a  0
a  0
a  1
b  1
b  1
a  2
b  2
c  2

I want to group df by df.y. In this grouping, I want to aggregate df.X in a specific way:

It should look like this:

X      y
a      0
b      1
[b,c]  2

Edit

When I run:

mask1 = s[['b','c']].ge(0.25).any(1)

s1 = np.where(s['b']==s['c'], {'b','c'}, s[['b','c']].idxmax(1))

pd.Series(np.where(mask1, s1, 'a'), index=s.index)

I get this error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-162-c6e6610b3c30> in <module>
----> 1 mask1 = s[['b','c']].ge(0.25).any(1)
      2 
      3 s1 = np.where(s['b']==s['c'], {'b','c'}, s[['b','c']].idxmax(1))
      4 
      5 pd.Series(np.where(mask1, s1, 'a'), index=s.index)

~/anaconda3/envs/cv2/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2903             if is_iterator(key):
   2904                 key = list(key)
-> 2905             indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
   2906 
   2907         # take() does not accept boolean indexers

~/anaconda3/envs/cv2/lib/python3.6/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
   1252             keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
   1253 
-> 1254         self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
   1255         return keyarr, indexer
   1256 

~/anaconda3/envs/cv2/lib/python3.6/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
   1296             if missing == len(indexer):
   1297                 axis_name = self.obj._get_axis_name(axis)
-> 1298                 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   1299 
   1300             # We (temporarily) allow for some missing keys with .loc, except in

KeyError: "None of [Index(['b', 'c'], dtype='object')] are in the [columns]"

Upvotes: 0

Views: 50

Answers (1)

Quang Hoang
Quang Hoang

Reputation: 150785

It's pretty straightforward from your logic:

s = (df.groupby('y')['X'].value_counts(normalize=True)
       .unstack('X', fill_value=0)
    )

mask1 = s[['b','c']].ge(0.25).any(1)

s1 = np.where(s['b']==s['c'], {'b','c'}, s[['b','c']].idxmax(1))

pd.Series(np.where(mask1, s1, 'a'), index=s.index)

Output:

y
0         a
1         b
2    {c, b}
dtype: object

Upvotes: 1

Related Questions