Reputation: 12935
I have a DataFrame like this one:
Communications and Search Business General Lifestyle
0 0.745763 0.050847 0.118644 0.084746
0 0.333333 0.000000 0.583333 0.083333
0 0.617021 0.042553 0.297872 0.042553
0 0.435897 0.000000 0.410256 0.153846
0 0.358974 0.076923 0.410256 0.153846
I want to create a new column comprised of the column labels of each row’s maximum value. The desired output is like this:
Communications and Search Business General Lifestyle Max
0 0.745763 0.050847 0.118644 0.084746 Communications
0 0.333333 0.000000 0.583333 0.083333 Business
0 0.617021 0.042553 0.297872 0.042553 Communications
0 0.435897 0.000000 0.410256 0.153846 Communications
0 0.358974 0.076923 0.410256 0.153846 Business
Upvotes: 215
Views: 190216
Reputation: 23449
Another solution is to flag the position of the maximum values of each row and get the corresponding column names. In particular, this solution works well if multiple columns contain the maximum value for some rows and you want to return all column names with the maximum value for each row:1
Code:
# look for the max values in each row
mxs = df.eq(df.max(axis=1), axis=0)
# join the column names of the max values of each row into a single string
df['Max'] = mxs.dot(mxs.columns + ', ').str.rstrip(', ')
A slight variation: If you want to pick one column randomly when multiple columns contain the maximum value:
Code:
mxs = df.eq(df.max(axis=1), axis=0)
df['Max'] = mxs.where(mxs).stack().groupby(level=0).sample(n=1).index.get_level_values(1)
You can also do this for specific columns by selecting the columns:
# for column names of max value of each row
cols = ['Communications', 'Search', 'Business']
mxs = df[cols].eq(df[cols].max(axis=1), axis=0)
df['max among cols'] = mxs.dot(mxs.columns + ', ').str.rstrip(', ')
1: idxmax(1)
returns only the first column name with the max value if the max value is the same for multiple columns, which may not be desirable depending on the use case. This solution generalizes idxmax(1)
; in particular, if the max values are unique in each row, it matches the idxmax(1)
solution.
Upvotes: 7
Reputation: 11
Using numpy argmax is blazing fast. I've tested in a dataframe with 3,744,965 rows and it takes 103ms.
%timeit df.idxmax(axis=1)
7.67 s ± 28.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.columns[df.to_numpy().argmax(axis=1)]
103 ms ± 355 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 1
Reputation: 177048
You can use idxmax
with axis=1
to find the column with the greatest value on each row:
>>> df.idxmax(axis=1)
0 Communications
1 Business
2 Communications
3 Communications
4 Business
dtype: object
To create the new column 'Max', use df['Max'] = df.idxmax(axis=1)
.
To find the row index at which the maximum value occurs in each column, use df.idxmax()
(or equivalently df.idxmax(axis=0)
).
Upvotes: 308
Reputation: 4292
And if you want to produce a column containing the name of the column with the maximum value but considering only a subset of columns then you use a variation of @ajcr's answer:
df['Max'] = df[['Communications','Business']].idxmax(axis=1)
Upvotes: 57
Reputation: 77027
You could apply
on dataframe and get argmax()
of each row via axis=1
In [144]: df.apply(lambda x: x.argmax(), axis=1)
Out[144]:
0 Communications
1 Business
2 Communications
3 Communications
4 Business
dtype: object
Here's a benchmark to compare how slow apply
method is to idxmax()
for len(df) ~ 20K
In [146]: %timeit df.apply(lambda x: x.argmax(), axis=1)
1 loops, best of 3: 479 ms per loop
In [147]: %timeit df.idxmax(axis=1)
10 loops, best of 3: 47.3 ms per loop
Upvotes: 14