mlx
mlx

Reputation: 514

Setting highest value in row to 1 and rest to 0 in pandas

My original dataframe looks like this :

A       B       C
0.10    0.83    0.07
0.40    0.30    0.30
0.70    0.17    0.13    
0.72    0.04    0.24    
0.15    0.07    0.78    

I would like that each row becomes binarized : 1 would be assigned to the column with the highest value and the rest would be set to 0, so the previous dataframe would become :

A   B   C
0   1   0
1   0   0
1   0   0   
1   0   0   
0   0   1   

How can this be done ?
Thanks.

EDIT : I understand that a specific case made my question ambiguous. I should've said that in case 3 columns are equal for a given row, I'd still want to get a [1 0 0] vector and not [1 1 1] for that row.

Upvotes: 8

Views: 4435

Answers (3)

user3483203
user3483203

Reputation: 51165

Using numpy with argmax

m = np.zeros_like(df.values)
m[np.arange(len(df)), df.values.argmax(1)] = 1

df1 = pd.DataFrame(m, columns = df.columns).astype(int)

# Result


   A  B  C
0  0  1  0
1  1  0  0
2  1  0  0
3  1  0  0
4  0  0  1

Timings

df_test = df.concat([df] * 1000)

def chris_z(df):
     m = np.zeros_like(df.values)
     m[np.arange(len(df)), df.values.argmax(1)] = 1
     return pd.DataFrame(m, columns = df.columns).astype(int)

def haleemur(df):
    return df.apply(lambda x: x == x.max(), axis=1).astype(int)

def haleemur_2(df):
    return pd.DataFrame((df.T == df.T.max()).T.astype(int), columns=df.columns)

def sacul(df):
    return pd.DataFrame(np.where(df.T == df.T.max(), 1, 0),index=df.columns).T

Results

In [320]: %timeit chris_z(df_test)
358 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [321]: %timeit haleemur(df_test)
1.14 s ± 45.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [329]: %timeit haleemur_2(df_test)
972 µs ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [333]: %timeit sacul(df_test)
1.01 ms ± 3.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Upvotes: 10

Haleemur Ali
Haleemur Ali

Reputation: 28283

 df.apply(lambda x: x == x.max(), axis=1).astype(int) 

should do it. This works by checking if the value is the maximum of that column, and then casting to integer (True -> 1, False -> 0)

Instead of apply-ing a lambda row-wise, it is also possible to transpose the dataframe & compare to max and then transpose back

(df.T == df.T.max()).T.astype(int)

And lastly, a very fast numpy based solution:

pd.DataFrame((df.T.values == np.amax(df.values, 1)).T*1, columns = df.columns)

The output is in all cases:

   A  B  C
0  0  1  0
1  1  0  0
2  1  0  0
3  1  0  0
4  0  0  1

Upvotes: 4

sacuL
sacuL

Reputation: 51365

Another numpy method, using np.where:

import numpy as np
new_df = pd.DataFrame(np.where(df.T == df.T.max(), 1, 0),index=df.columns).T
   A  B  C
0  0  1  0
1  1  0  0
2  1  0  0
3  1  0  0
4  0  0  1

Upvotes: 3

Related Questions