hlin117
hlin117

Reputation: 22308

Binarize data frame values based upon a column value

I have a dataframe that looks like this

+---------+-------------+------------+------------+
| hello   | val1        | val2       | val3       |
+---------+-------------+------------+------------+
| 1.024   | -10.764779  | -8.230176  | -5.689302  |
| 16      | -15.772744  | -10.794013 | -5.79148   |
| 1.024   | -18.4738    | -13.935423 | -9.392713  |
| 0.064   | -11.642506  | -9.711523  | -7.772969  |
| 1.024   | -4.185368   | -2.094441  | 0.048861   |
+---------+-------------+------------+------------+

Let this dataframe be df. This is the operation I essentially would like to do

values = ["val1", "val2", "val3"]
for ind in df.index:
    hello = df.loc[ind, "hello"]
    for name in values:
        df.loc[ind, name] = (df.loc[ind, name] >= hello)

Essentially for every row i and column j, if val_j is less than hello_i, then val_j = False, otherwise val_j = True

This is obviously not vectorized, and with my giant version of this table on my computer, my computer is having trouble performing these alterations.

What's the vectorized version of the operation above?

Upvotes: 1

Views: 817

Answers (2)

EdChum
EdChum

Reputation: 394469

It would be quicker to test the entire series against the hello series:

In [268]:

val_cols = [col for col in df if 'val' in col]
for col in val_cols:
    df[col] = df[col] >= df['hello']
df    
Out[268]:
    hello   val1   val2   val3
0   1.024  False  False  False
1  16.000  False  False  False
2   1.024  False  False  False
3   0.064  False  False  False
4   1.024  False  False  False

If we compare the performance:

In [273]:

%%timeit
val_cols = [col for col in df if 'val' in col]
for col in val_cols:
    df[col] = df[col] >= df['hello']
df    
1000 loops, best of 3: 630 µs per loop
In [275]:

%%timeit
column_names = [name for name in df.columns if "val" in name]
binarized = df.apply(lambda row : row[column_names] >= row["hello"], axis=1)
df[binarized.columns] = binarized
df
100 loops, best of 3: 6.17 ms per loop

We see that my method is 10x faster as it is vectorised, your method is essentially looping over each row

Upvotes: 1

hlin117
hlin117

Reputation: 22308

Some experimentation led me to this

column_names = [name for name in df.columns if "val" in name]
binarized = df.apply(lambda row : row[column_names] >= row["hello"], axis=1)
df[binarized.columns] = binarized

This works.

Upvotes: 0

Related Questions