Reputation: 67
I have created a data frame as follows:
np.random.seed(0)
lvl0 = ['A','B']
lvl1 = ['x', 'y', 'z']
idx = pd.MultiIndex.from_product([lvl0, lvl1])
cols = ['c1', 'c2']
df = pd.DataFrame(index=idx, columns=cols)
df.loc[:] = np.random.randint(0,2, size=df.shape)
for v in lvl0:
df.loc[(v, 'mode'), :] = np.nan
df.sort_index(inplace=True)
Which gives
c1 c2
A mode NaN NaN
x 0 1
y 1 0
z 1 1
B mode NaN NaN
x 1 1
y 1 1
z 1 0
I then calculate the modes for group A and B for both columns as follows
for c in cols:
modes = df.groupby(level=0)[c].agg(pd.Series.mode)
df.loc[(lvl0, 'default'), c] = modes.values
which results in
c1 c2
A mode 1 1
x 0 1
y 1 0
z 1 1
B mode 1 1
x 1 1
y 1 1
z 1 0
I now want to remove all rows where the values are equal to the mode of the corresponding first level group (A-B), which in this case should result in
c1 c2
A mode 1 1
x 0 1
y 1 0
B mode 1 1
z 1 0
I could achieve the desired result by looping over lvl0
, but I was wondering if there was a more elegant solution, perhaps using groupby
.
Additionally, I wonder if there is a way to add the mode rows when the modes are calculated, as opposed to adding empty (NaN) rows beforehand. If I don't add the empty rows beforehand, the line df.loc[(lvl0, 'default'), c] = modes.values
gives me a KeyError on 'default'.
Upvotes: 2
Views: 200
Reputation: 863186
Use:
np.random.seed(0)
lvl0 = ['A','B']
lvl1 = ['x', 'y', 'z']
idx = pd.MultiIndex.from_product([lvl0, lvl1])
cols = ['c1', 'c2']
df = pd.DataFrame(index=idx, columns=cols)
df.loc[:] = np.random.randint(0,2, size=df.shape)
df.sort_index(inplace=True)
#get first modes per groups
modes = df.groupby(level=0).agg(lambda x: x.mode().iat[0])
#compare modes per groups with removed second level and append modes df
df = (pd.concat([modes.assign(tmp='mode').set_index('tmp', append=True), \
df[df.droplevel(1).ne(modes).any(axis=1).to_numpy()]])
.sort_index())
print (df)
c1 c2
tmp
A mode 1 1
x 0 1
y 1 0
B mode 1 1
z 1 0
Upvotes: 1