Reputation: 17854
How to get the most frequent row in a DataFrame? For example, if I have the following table:
col_1 col_2 col_3
0 1 1 A
1 1 0 A
2 0 1 A
3 1 1 A
4 1 0 B
5 1 0 C
Expected result:
col_1 col_2 col_3
0 1 1 A
EDIT: I need the most frequent row (as one unit) and not the most frequent column value that can be calculated with the mode()
method.
Upvotes: 15
Views: 729
Reputation: 17854
In Pandas 1.1.0. is possible to use the method value_counts()
to count unique rows in DataFrame:
df.value_counts()
Output:
col_1 col_2 col_3
1 1 A 2
0 C 1
B 1
A 1
0 1 A 1
This method can be used to find the most frequent row:
df.value_counts().head(1).index.to_frame(index=False)
Output:
col_1 col_2 col_3
0 1 1 A
Upvotes: 2
Reputation: 450
You can do this with groupby and size:
df = df.groupby(df.columns.tolist(),as_index=False).size()
result = df.iloc[[df["size"].idxmax()]].drop(["size"], axis=1)
result.reset_index(drop=True) #this is just to reset the index
Upvotes: 4
Reputation: 5949
npi_indexed
library helps to perform some actions on 'groupby' type of problems with less script and similar performance as numpy
. So this is alternative and pretty similar way to @Divakar's np.unique()
based solution:
arr = df.values.astype(str)
idx = npi.multiplicity(arr)
output = df.iloc[[idx[c.argmax()]]]
Upvotes: 3
Reputation: 221614
With NumPy's np.unique
-
In [92]: u,idx,c = np.unique(df.values.astype(str), axis=0, return_index=True, return_counts=True)
In [99]: df.iloc[[idx[c.argmax()]]]
Out[99]:
col_1 col_2 col_3
0 1 1 A
If you are looking for performance, convert the string column to numeric and then use np.unique
-
a = np.c_[df.col_1, df.col_2, pd.factorize(df.col_3)[0]]
u,idx,c = np.unique(a, axis=0, return_index=True, return_counts=True)
Upvotes: 10
Reputation: 323316
Check groupby
df.groupby(df.columns.tolist()).size().sort_values().tail(1).reset_index().drop(0,1)
col_1 col_2 col_3
0 1 1 A
Upvotes: 12