Reputation: 22308
I have a dataframe that looks like this
+---------+-------------+------------+------------+
| hello | val1 | val2 | val3 |
+---------+-------------+------------+------------+
| 1.024 | -10.764779 | -8.230176 | -5.689302 |
| 16 | -15.772744 | -10.794013 | -5.79148 |
| 1.024 | -18.4738 | -13.935423 | -9.392713 |
| 0.064 | -11.642506 | -9.711523 | -7.772969 |
| 1.024 | -4.185368 | -2.094441 | 0.048861 |
+---------+-------------+------------+------------+
Let this dataframe be df
. This is the operation I essentially would like to do
values = ["val1", "val2", "val3"]
for ind in df.index:
hello = df.loc[ind, "hello"]
for name in values:
df.loc[ind, name] = (df.loc[ind, name] >= hello)
Essentially for every row i
and column j
, if val_j
is less than hello_i
, then val_j = False
, otherwise val_j = True
This is obviously not vectorized, and with my giant version of this table on my computer, my computer is having trouble performing these alterations.
What's the vectorized version of the operation above?
Upvotes: 1
Views: 817
Reputation: 394469
It would be quicker to test the entire series against the hello series:
In [268]:
val_cols = [col for col in df if 'val' in col]
for col in val_cols:
df[col] = df[col] >= df['hello']
df
Out[268]:
hello val1 val2 val3
0 1.024 False False False
1 16.000 False False False
2 1.024 False False False
3 0.064 False False False
4 1.024 False False False
If we compare the performance:
In [273]:
%%timeit
val_cols = [col for col in df if 'val' in col]
for col in val_cols:
df[col] = df[col] >= df['hello']
df
1000 loops, best of 3: 630 µs per loop
In [275]:
%%timeit
column_names = [name for name in df.columns if "val" in name]
binarized = df.apply(lambda row : row[column_names] >= row["hello"], axis=1)
df[binarized.columns] = binarized
df
100 loops, best of 3: 6.17 ms per loop
We see that my method is 10x faster as it is vectorised, your method is essentially looping over each row
Upvotes: 1
Reputation: 22308
Some experimentation led me to this
column_names = [name for name in df.columns if "val" in name]
binarized = df.apply(lambda row : row[column_names] >= row["hello"], axis=1)
df[binarized.columns] = binarized
This works.
Upvotes: 0