Reputation: 2676
I've got a pandas DataFrame comprising items from an e-commerce site. I'm checking whether or not items still exist in the data frame on either subsequent (look ahead) or previous (behind) days, with a function like this:
#get a unique set of product urls for the next 2 days. If the current row's URL is in that set, mark it as 0 (not expiring); otherwise mark it 1 (expiring).
def is_expiring_product(row):
if row.childurl in df[(df['date'] > row['date']) & (df['date'] < (row['date'] + np.timedelta64(2,"D")) ) ].childurl.unique():
row.is_expiring_product = 0
else:
row.is_expiring_product = 1
return row
The problem is that this takes ages (24 hours+, if it ever completes) to apply to a frame of 1M+ records.
Intuition tells me there must be a more performant method for this...
Maybe creating a separate DataFrame of unique URLs for each day, then seeking on that DataFrame instead of the larger frame? I'm not sure why that would be any faster though...
Upvotes: 0
Views: 214
Reputation: 31011
For the test I used the following DataFrame (date column is of type datetime64[ns]):
date url
0 2019-09-10 Xxx
1 2019-09-12 Xxx
2 2019-09-14 Xxx
3 2019-09-11 Yyy
4 2019-09-12 Yyy
5 2019-09-12 Zz1
6 2019-09-12 Zz2
7 2019-09-08 Ttt
8 2019-09-12 Ttt
9 2019-09-15 Ttt
The starting point is to define 2 variables, used in cutting of df, by date, into bins (Past, Today and Future):
dMin = pd.to_datetime(0) # "Minimal" date
dMax = pd.to_datetime('2050-12-31') # "Maximal" date
Assume that we want to compute is_expiring_product only for 2019-09-12, so the first thing is to set currDate (will also be used in cutting):
currDate = pd.to_datetime('2019-09-12')
Then add is_expiring_product column, initially filled with empty strings:
df = df.assign(is_expiring_product='')
Compute bucket - the result of cutting dates into bins:
bucket = pd.cut(df.date, bins = [dMin, currDate - pd.offsets.Day(1),
currDate, dMax], labels=['Past', 'Today', 'Future'])
The next step is to compute df_current - a subset of rows for the "current" date:
df_current = df[bucket == 'Today']
To generate indices of urls in rows for "other" dates, run:
ind_other = pd.Index(df[bucket.isin(['Past', 'Future'])].url.unique())
And the last step is to generate values for is_expiring_product and update this column:
df.is_expiring_product.update((~df_current.url.isin(ind_other)).astype(int))
The result is:
date url is_expiring_product
0 2019-09-10 Xxx
1 2019-09-12 Xxx 0
2 2019-09-14 Xxx
3 2019-09-11 Yyy
4 2019-09-12 Yyy 0
5 2019-09-12 Zz1 1
6 2019-09-12 Zz2 1
7 2019-09-08 Ttt
8 2019-09-12 Ttt 0
9 2019-09-15 Ttt
Description of results:
The source DataFrame is the same, start from defining dMin and dMax as before.
Then define the following function to set the expiration status:
def setExpired(grp):
currDate = grp.name # Grouping key
bucket = pd.cut(df.date, bins = [dMin, currDate - pd.offsets.Day(1),
currDate, dMax], labels=['Past', 'Current', 'Future'])
df_current = df[bucket == 'Current']
ind_other = pd.Index(df[bucket.isin(['Past', 'Future'])].url.unique())
res = (~df_current.url.isin(ind_other)).astype(int)
return res
And the whole computing boils down to a single instruction, applicating this function to each group (for each date):
df['is_expiring_product'] = df.groupby('date').apply(setExpired).droplevel(0)
The additional call to droplevel(0) is required, because the result of apply has a MultiIndex with:
So in order to save it in df (using "ordinary" index), the top level of the MultiIndex must be dropped.
This time the result is:
date url is_expiring_product
0 2019-09-10 Xxx 0
1 2019-09-12 Xxx 0
2 2019-09-14 Xxx 0
3 2019-09-11 Yyy 0
4 2019-09-12 Yyy 0
5 2019-09-12 Zz1 1
6 2019-09-12 Zz2 1
7 2019-09-08 Ttt 0
8 2019-09-12 Ttt 0
9 2019-09-15 Ttt 0
Note that there is no need to compute is_expiring_product for each row separately.
Much simpler and quicker, totally different solution is to:
This solution works as follows:
The whole above action can be saved as a single instruction:
df['is_expiring_product'] = df.groupby('url').date\
.transform(lambda grp: 1 if grp.unique().size == 1 else 0)
with the same result as before.
Please write a comment, how long did it take to execute the above instruction on your DataFrame and how many rows it has.
Upvotes: 1