Reputation: 1231
This is an example of the data frame i'm working with:
d = {
'item_number':['bdsm1000', 'bdsm1000', 'bdsm1000', 'ZZRWB18','ZZRWB18', 'ZZRWB18', 'ZZRWB18', 'ZZHP1427BLK', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1414', 'ZZHP1414', 'ZZHP1414', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE'],
'Comp_ID':[2454, 2454, 2454, 1395, 1395, 1395, 1395, 3378, 1266941, 660867, 43978, 1266941, 660867, 43978, 1266941, 660867, 43978, 1266941, 660867, 43978, 43978, 43978, 43978, 1197347907, 70745, 4737, 1197347907, 4737, 1197347907, 70745, 4737, 1197347907, 70745, 4737, 1197347907, 4737, 1197487704, 1197347907, 70745, 23872, 4737, 1197347907, 4737, 1197487704, 1197347907, 23872, 4737, 1197487704, 1197347907, 70745],
'date':['2016-11-22', '2016-11-20', '2016-11-19', '2016-11-22', '2016-11-20', '2016-11-19', '2016-11-18', '2016-11-22', '2016-11-22', '2016-11-22', '2016-11-22', '2016-11-20', '2016-11-20', '2016-11-20', '2016-11-19', '2016-11-19', '2016-11-19', '2016-11-18', '2016-11-18', '2016-11-18', '2016-11-22', '2016-11-20', '2016-11-19', '2016-11-22', '2016-11-22', '2016-11-22', '2016-11-21', '2016-11-21', '2016-11-20', '2016-11-20', '2016-11-20', '2016-11-19', '2016-11-19', '2016-11-19', '2016-11-18', '2016-11-18', '2016-11-22', '2016-11-22', '2016-11-22', '2016-11-22', '2016-11-22', '2016-11-21', '2016-11-21', '2016-11-20', '2016-11-20', '2016-11-20', '2016-11-20', '2016-11-19', '2016-11-19', '2016-11-19']}
df = pd.DataFrame(data=d)
df.date = pd.to_datetime(df.date)
I'd like to count consecutive observations starting from 2016-11-22 that there are grouped by Comp_ID and item_number.
Essentially, what I am looking to do, is count how many days in a row there is an observation counting back from todays date for each Comp_ID and item_number. (this example was put together on the 22nd of Nov) Consecutive observations observed weeks/ days prior to today are not relevant. Only sequences like today... yesterday... the day before yesterday... and so on are relevant.
I got this to work on a smaller sample, but it seems to be getting tripped up on a larger data-set.
Here is the code for the smaller sample. I need to find the consecutive dates with observations across thousands of sellers/ items. For some reason, the below code did not work on the larger data set.
d = {'item_number':['KIN005','KIN005','KIN005','KIN005','KIN005','A789B','A789B','A789B','G123H','G123H','G123H'],
'Comp_ID':['1395','1395','1395','1395','1395','7787','7787','7787','1395','1395','1395'],
'date':['2016-11-22','2016-11-21','2016-11-20','2016-11-14','2016-11-13','2016-11-22','2016-11-21','2016-11-12','2016-11-22','2016-11-21','2016-11-08']}
df = pd.DataFrame(data=d)
df.date = pd.to_datetime(df.date)
d = pd.Timedelta(1, 'D')
df = df.sort_values(['item_number','date','Comp_ID'],ascending=False)
g = df.groupby(['Comp_ID','item_number'])
sequence = g['date'].apply(lambda x: x.diff().fillna(0).abs().le(d)).reset_index()
sequence.set_index('index',inplace=True)
test = df.join(sequence)
test.columns = ['Comp_ID','date','item_number','consecutive']
g = test.groupby(['Comp_ID','item_number'])
g['consecutive'].apply(lambda x: x.idxmin() - x.idxmax() )
This gets the desired result for the smaller data-set:
Comp_ID item_number
1395 G123H 2
KIN005 3
7787 KIN005 2
Name: consecutive, dtype: int64
Upvotes: 6
Views: 4379
Reputation: 210832
You can do it this way:
today = pd.to_datetime('2016-11-22')
# sort DF by `date` (descending)
x = df.sort_values('date', ascending=0)
g = x.groupby(['Comp_ID','item_number'])
# compare the # of days to `today` with a consecutive day# in each group
x[(today - x['date']).dt.days == g.cumcount()].groupby(['Comp_ID','item_number']).size()
Result:
Comp_ID item_number
1395 G123H 2
KIN005 3
7787 A789B 2
dtype: int64
PS thanks to @DataSwede's for faster diff
calculation!
Explanation:
In [124]: x[(today - x['date']).dt.days == g.cumcount()] \
.sort_values(['Comp_ID','item_number','date'], ascending=[1,1,0])
Out[124]:
Comp_ID date item_number
8 1395 2016-11-22 G123H
9 1395 2016-11-21 G123H
0 1395 2016-11-22 KIN005
1 1395 2016-11-21 KIN005
2 1395 2016-11-20 KIN005
5 7787 2016-11-22 A789B
6 7787 2016-11-21 A789B
Upvotes: 5
Reputation: 1901
First I will propose that we yield a series of dates, each 1 day less than the prior...
import datetime
import pandas as pd
def gen_prior_date(start_date):
yield start_date
while True:
start_date -= datetime.timedelta(days=1)
yield start_date
...
>>> start_date = datetime.date(2016, 11, 22)
>>> back_in_time = gen_prior_date(start_date)
>>> next(back_in_time)
datetime.date(2016, 11, 22)
>>> next(back_in_time)
datetime.date(2016, 11, 21)
Now we need a function we can apply to each group...
def count_consec_dates(dates, start_date):
dates = pd.to_datetime(dates.values).date
dates_set = set(dates) # O(1) vs O(n) lookup times
back_in_time = gen_prior_date(start_date)
tally = 0
while next(back_in_time) in dates_set: # jump out on first miss
tally += 1
return tally
The rest is easy...
>>> small_data = {'item_number': ['KIN005','KIN005','KIN005','KIN005','KIN005','A789B','A789B','A789B','G123H','G123H','G123H'],
... 'Comp_ID': ['1395','1395','1395','1395','1395','7787','7787','7787','1395','1395','1395'],
... 'date': ['2016-11-22','2016-11-21','2016-11-20','2016-11-14','2016-11-13','2016-11-22','2016-11-21','2016-11-12','2016-11-22','2016-11-21','2016-11-08']}
>>> small_df = pd.DataFrame(data=small_data)
>>> start_date = datetime.date(2016, 11, 22)
>>> groups = small_df.groupby(['Comp_ID', 'item_number']).date
>>> groups.apply(lambda x: count_consec_dates(x, start_date))
Comp_ID item_number
1395 G123H 2
KIN005 3
7787 A789B 2
Upvotes: 3