MDescamps
MDescamps

Reputation: 67

Pandas multiindex: drop rows with group-specific condition

I have created a data frame as follows:

np.random.seed(0)
lvl0 = ['A','B']
lvl1 = ['x', 'y', 'z']
idx = pd.MultiIndex.from_product([lvl0, lvl1])
cols = ['c1', 'c2']
df = pd.DataFrame(index=idx, columns=cols)
df.loc[:] = np.random.randint(0,2, size=df.shape)

for v in lvl0:
    df.loc[(v, 'mode'), :] = np.nan
df.sort_index(inplace=True)

Which gives

         c1   c2
A mode  NaN  NaN
  x       0    1
  y       1    0
  z       1    1
B mode  NaN  NaN
  x       1    1
  y       1    1
  z       1    0

I then calculate the modes for group A and B for both columns as follows

for c in cols:
    modes = df.groupby(level=0)[c].agg(pd.Series.mode)
    df.loc[(lvl0, 'default'), c] = modes.values

which results in

       c1 c2
A mode  1  1
  x     0  1
  y     1  0
  z     1  1
B mode  1  1
  x     1  1
  y     1  1
  z     1  0

I now want to remove all rows where the values are equal to the mode of the corresponding first level group (A-B), which in this case should result in

       c1 c2
A mode  1  1
  x     0  1
  y     1  0
B mode  1  1
  z     1  0

I could achieve the desired result by looping over lvl0, but I was wondering if there was a more elegant solution, perhaps using groupby.

Additionally, I wonder if there is a way to add the mode rows when the modes are calculated, as opposed to adding empty (NaN) rows beforehand. If I don't add the empty rows beforehand, the line df.loc[(lvl0, 'default'), c] = modes.values gives me a KeyError on 'default'.

Upvotes: 2

Views: 200

Answers (1)

jezrael
jezrael

Reputation: 863186

Use:

np.random.seed(0)
lvl0 = ['A','B']
lvl1 = ['x', 'y', 'z']
idx = pd.MultiIndex.from_product([lvl0, lvl1])
cols = ['c1', 'c2']
df = pd.DataFrame(index=idx, columns=cols)
df.loc[:] = np.random.randint(0,2, size=df.shape)

df.sort_index(inplace=True)

#get first modes per groups
modes = df.groupby(level=0).agg(lambda x: x.mode().iat[0])
    
#compare modes per groups with removed second level and append modes df
df = (pd.concat([modes.assign(tmp='mode').set_index('tmp', append=True), \
                df[df.droplevel(1).ne(modes).any(axis=1).to_numpy()]])
        .sort_index())
print (df)
        c1  c2
  tmp         
A mode   1   1
  x      0   1
  y      1   0
B mode   1   1
  z      1   0

Upvotes: 1

Related Questions