Raihan Masud
Raihan Masud

Reputation: 21

How to set (1) to max elements in pandas dataframe and (0) to everything else?

Let's say I have a pandas DataFrame.

df = pd.DataFrame(index = [ix for ix in range(10)], columns=list('abcdef'), data=np.random.randn(10,6))

df:

         a         b         c         d         e         f
0 -1.238393 -0.755117 -0.228638 -0.077966  0.412947  0.887955
1 -0.342087  0.296171  0.177956  0.701668 -0.481744 -1.564719
2  0.610141  0.963873 -0.943182 -0.341902  0.326416  0.818899
3 -0.561572  0.063588 -0.195256 -1.637753  0.622627  0.845801
4 -2.506322 -1.631023  0.506860  0.368958  1.833260  0.623055
5 -1.313919 -1.758250 -1.082072  1.266158  0.427079 -1.018416
6 -0.781842  1.270133 -0.510879 -1.438487 -1.101213 -0.922821
7 -0.456999  0.234084  1.602635  0.611378 -1.147994  1.204318
8  0.497074  0.412695 -0.458227  0.431758  0.514382 -0.479150
9 -1.289392 -0.218624  0.122060  2.000832 -1.694544  0.773330

how to I get set 1 to rowwise max and 0 to other elements?

I came up with:

>>> for i in range(len(df)):
...     df.loc[i][df.loc[i].idxmax(axis=1)] = 1
...     df.loc[i][df.loc[i] != 1] = 0

generates df:

   a  b  c  d  e  f
0  0  0  0  0  0  1
1  0  0  0  1  0  0
2  0  1  0  0  0  0
3  0  0  0  0  0  1
4  0  0  0  0  1  0
5  0  0  0  1  0  0
6  0  1  0  0  0  0
7  0  0  1  0  0  0
8  0  0  0  0  1  0
9  0  0  0  1  0  0

Does anyone has a better way of doing it? May be by getting rid of the for loop or applying lambda?

Upvotes: 2

Views: 111

Answers (3)

EdChum
EdChum

Reputation: 394099

Use max and check for equality using eq and cast the boolean df to int using astype, this will convert True and False to 1 and 0:

In [21]:
df = pd.DataFrame(index = [ix for ix in range(10)], columns=list('abcdef'), data=np.random.randn(10,6))
df

Out[21]:
          a         b         c         d         e         f
0  0.797000  0.762125 -0.330518  1.117972  0.817524  0.041670
1  0.517940  0.357369 -1.493552 -0.947396  3.082828  0.578126
2  1.784856  0.672902 -1.359771 -0.090880 -0.093100  1.099017
3 -0.493976 -0.390801 -0.521017  1.221517 -1.303020  1.196718
4  0.687499 -2.371322 -2.474101 -0.397071  0.132205  0.034631
5  0.573694 -0.206627 -0.106312 -0.661391 -0.257711 -0.875501
6 -0.415331  1.185901  1.173457  0.317577 -0.408544 -1.055770
7 -1.564962 -0.408390 -1.372104 -1.117561 -1.262086 -1.664516
8 -0.987306  0.738833 -1.207124  0.738084  1.118205 -0.899086
9  0.282800 -1.226499  1.658416 -0.381222  1.067296 -1.249829

In [22]:
df = df.eq(df.max(axis=1), axis=0).astype(int)
df

Out[22]:
   a  b  c  d  e  f
0  0  0  0  1  0  0
1  0  0  0  0  1  0
2  1  0  0  0  0  0
3  0  0  0  1  0  0
4  1  0  0  0  0  0
5  1  0  0  0  0  0
6  0  1  0  0  0  0
7  0  1  0  0  0  0
8  0  0  0  0  1  0
9  0  0  1  0  0  0

Timings

In [24]:
# @Raihan Masud's method
%timeit df.apply( lambda x: np.where(x == x.max() , 1 , 0) , axis = 1)
# mine
%timeit df.eq(df.max(axis=1), axis=0).astype(int)
100 loops, best of 3: 7.94 ms per loop
1000 loops, best of 3: 640 µs per loop

In [25]:
# @Nader Hisham's method
%%timeit 
def max_binary(df):
    binary = np.where( df == df.max() , 1 , 0 )
    return binary
​
df.apply( max_binary , axis = 1)
100 loops, best of 3: 9.63 ms per loop

You can see that my method is over 12X faster than @Raihan's method

In [4]:
%%timeit
for i in range(len(df)):
    df.loc[i][df.loc[i].idxmax(axis=1)] = 1
    df.loc[i][df.loc[i] != 1] = 0

10 loops, best of 3: 21.1 ms per loop

The for loop is also significantly slower

Upvotes: 1

Raihan Masud
Raihan Masud

Reputation: 21

Following Nader's pattern, this is a shorter version:

df.apply( lambda x: np.where(x == x.max() , 1 , 0) , axis = 1)

Upvotes: 0

Nader Hisham
Nader Hisham

Reputation: 5414

import numpy as np


def max_binary(df):
        binary = np.where( df == df.max() , 1 , 0 )
        return binary


df.apply( max_binary , axis = 1)

Upvotes: 0

Related Questions