Reputation: 1581
Given pd.DataFrame
with 0.0 < values < 1.0
, I would like to convert it to binary values 0
/1
according to defined threshold eps = 0.5
,
0 1 2
0 0.35 0.20 0.81
1 0.41 0.75 0.59
2 0.62 0.40 0.94
3 0.17 0.51 0.29
Right now, I only have this for loop
which takes quite long time for large dataset:
import numpy as np
import pandas as pd
data = np.array([[.35, .2, .81],[.41, .75, .59],
[.62, .4, .94], [.17, .51, .29]])
df = pd.DataFrame(data, index=range(data.shape[0]), columns=range(data.shape[1]))
eps = .5
b = np.zeros((df.shape[0], df.shape[1]))
for i in range(df.shape[0]):
for j in range(df.shape[1]):
if df.loc[i,j] < eps:
b[i,j] = 0
else:
b[i,j] = 1
df_bin = pd.DataFrame(b, columns=df.columns, index=df.index)
Does anybody know a more effective way to convert to binary values?
0 1 2
0 0.0 0.0 1.0
1 0.0 1.0 1.0
2 1.0 0.0 1.0
3 0.0 1.0 0.0
Thanks,
Upvotes: 6
Views: 14667
Reputation: 42916
Since we have a quite a some answers, which are all using different methods, I was curious about the speed comparison. Thought I share:
# create big test dataframe
dfbig = pd.concat([df]*200000, ignore_index=True)
print(dfbig.shape)
(800000, 3)
# pandas round()
%%timeit
dfbig.round()
101 ms ± 4.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# numpy round()
%%timeit
np.round(dfbig)
104 ms ± 2.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# pandas .ge & .astype
%%timeit
dfbig.ge(0.5).astype(int)
9.32 ms ± 170 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# numpy.where
%%timeit
np.where(dfbig<0.5, 0, 1)
21.5 ms ± 421 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Conlusion:
ge
& astype
np.where
np.round
round
Upvotes: 4
Reputation: 75080
Or you can use np.where()
and assign the values to the underlying array:
df[:]=np.where(df<0.5,0,1)
0 1 2
0 0 0 1
1 0 1 1
2 1 0 1
3 0 1 0
Upvotes: 8
Reputation: 59274
df.round
>>> df.round()
np.round
>>> np.round(df)
astype
>>> df.ge(0.5).astype(int)
All which yield
0 1 2
0 0.0 0.0 1.0
1 0.0 1.0 1.0
2 1.0 0.0 1.0
3 0.0 1.0 0.0
Note: round
works here because it automatically sets the threshold for .5
between two integers. For custom thresholds, use the 3rd solution
Upvotes: 9