Reputation: 107
I have a bunch of dataframes and same number of arrays which represents intervals(break numbers) in price
column in these dataframes
I need to assign new column called description_contrib
based on these intervals, e.g. if price is 16 USD and interval array looks like this [0,10]
that means description_contrib
column for this row will be 2, because 16 is greater then 0 and also greater than 10
I come up with this code:
def description_contribution(df_cat):
for i in range(0, len(df_cat)):
for j in range(0, len(intervals[i])):
df_cat[i]['description_contrib'].loc[df_cat[i]['price'] >= intervals[i][j]] = j
But it runs slow and there is probably more robust solution for this
How can i improve this?
UPD Data looks like this
train_id item_condition_id brand_name price shipping description_contrib
5644 1 Unknown 15.0 1 6
12506 1 Unknown 8.0 1 3
26141 1 Unknown 20.0 1 8
And intervals for this dataframe is:
[0.0, 0.0, 6.0, 8.0, 10.0, 12.0, 14.0, 16.0, 20.0, 22.0, 31.0]
Upvotes: 1
Views: 1265
Reputation: 402813
You can perform a broadcasted comparison with the numpy arrays -
v = (df.price.values[:, None] > intervals).sum(1)
This can be assigned back to df
-
df['description_contrib'] = v
The caveat with this is the memory usage, especially for larger data. A fair tradeoff for the speed.
Upvotes: 1
Reputation: 2048
Most of the time, the first option to speed things up is to replace loops with a vectorized operation. For example, you can make your code faster and more readable this way:
import pandas as pd
intervals = [0, 10]
df_cat = pd.DataFrame({'price': range(100)})
df_cat['description_contrib'] = sum(df_cat['price'] > v for v in intervals)
Assuming that df_cat has many rows and there are few intervals, this will give you a good performance. Still, faster ways may exists.
Upvotes: 1