Reputation: 23
I have the following dataframe, which is the result of a groupby operation:
df = pd.DataFrame(np.array([
['ID1','2019-09-06',1],
['ID1','2019-09-11',1],
['ID1','2019-09-25',2],
['ID1','2019-09-27',1],
['ID1','2019-10-21',1],
['ID2','2019-10-15',1],
['ID2','2019-10-17',3],
['ID2','2019-10-19',2],
['ID2','2019-11-09',1],
]), columns = ["id", "date", "value"])
And I want to have the sum of the "value" column for the same ID in a time window ending on "date". The expected output for a 7d window would be:
expected = pd.DataFrame(np.array([
['ID1','2019-09-06',1,1],
['ID1','2019-09-11',1,2],
['ID1','2019-09-25',2,2],
['ID1','2019-09-27',1,3],
['ID1','2019-10-21',1,1],
['ID2','2019-10-15',1,1],
['ID2','2019-10-17',3,4],
['ID2','2019-10-19',2,6],
['ID2','2019-11-09',1,1],
]), columns = ["id", "date", "value", "sum of values in 7d"])
I already have a code that works for that case, but it does not feel like the most clean solution. Also it's really slow when you have tens of thousands of lines (which is my case).
The function I have is:
def countPeriod(i, d, df, p = 7):
# Slices a subset with only the corresponding ID
aux = df[df["id"] == i]
# Gets the period
period = pd.date_range(end = d, periods = p)
# Sums "value" for each date in period
s = 0
for d in period:
try:
aux_date = aux[aux["date"] == d]
s = s + (aux_date["value"].sum())
except:
pass
return s
It returns the expected value, when I call it with:
result = df.copy()
result["date"] = pd.to_datetime(result["date"], dayfirst = True, format = "%Y-%m-%d").dt.date
result["value"] = result["value"].astype(int)
result["sum of values in 7d"] = result.apply(lambda x: countPeriod(x["id"], x["date"], result, 7), axis = 1)
I cannot think of other ways to do that. I even took a look on this topic, but it also doesn't seem to be well fitted to my problem.
Is there any cleaner and faster way to do that? I have to do such operations lots of times with the data I'm working on.
Upvotes: 0
Views: 799
Reputation: 23
The great example posted by r-beginners made me think of another solution to this problem.
A single function that runs on the dataset instead of using .apply(), but I used .rolling() as the way to get the results more easily.
def countPeriod3(df, p = 7):
# Gets the IDs
ids = df['id'].unique()
# Creates an empty dataframe to hold all slices
result = pd.DataFrame(columns = ['id', 'date', 'value', 'sum7d'])
for i in ids:
# Slices a subset with only the corresponding ID
df_slice = df[df["id"] == i]
# Sets date as index to use asfreq
aux = df_slice.set_index('date')
# Uses a daily frequency to fill the gaps and fills NA accordingly
aux = aux.asfreq('d')
aux['id'] = aux["id"].fillna(i)
aux['value'] = aux['value'].fillna(0)
# Rolling window to sum 7 days
aux['sum7d'] = aux['value'].rolling(p, min_periods=1).sum()
# Puts date back as a column
aux.reset_index()
# Deletes redundant columns
aux = aux.drop(columns = ['id', 'value'])
# Gets only the lines that appear on the slice
df_slice = df_slice.merge(aux, on='date', how='left')
# Puts all slices together
result = pd.concat([result,df_slice], ignore_index = True)
return result
In fact, I compared the run time for all of the codes with timeit, and got the following:
Thanks once again, r-beginners for the help!
Upvotes: 0
Reputation: 35125
I have considered only the basic features of PANDAS.
1.Using Grouper to create a DF that includes a record other than the date
2.Combine the original DF with the created DF
3.7 days to add a column.
4.Calculate the judgment for 7 days in the loop process
5.Joining to an empty DF
I'm not sure about the processing speed.
df['date'] = pd.to_datetime(df['date'])
df['value'] = df['value'].astype(int)
id_list = df['id'].unique()
new_df = pd.DataFrame(index=[], columns=['date','cnt','id_x','sum_7days'])
for k in id_list:
df_dict = df[df['id'] == k].groupby(pd.Grouper(key='date', freq='1d')).groups
all_day = pd.DataFrame({'date':list(df_dict.keys()), 'cnt':1}, index=np.arange(len(df_dict)))
all_day['id'] = k
all_day = all_day.merge(df[df['id'] == k], on='date', how='outer').fillna(0)
all_day.drop('id_y', inplace=True, axis=1)
day_7 = tmp = 0
all_day['sum_7days'] = 0
for i in range(len(all_day)):
if day_7 <= 7:
day_7 += all_day['cnt'].iloc[i]
tmp += all_day['value'].iloc[i]
all_day['sum_7days'].iloc[i] = tmp
elif day_7 > 7:
tmp = day_7 = 0
new_df = pd.concat([new_df,all_day], ignore_index=True)
new_df = new_df[~(new_df['value'] == 0.0)]
new_df = new_df.loc[:,['id_x','date','value','sum_7days']].reset_index(drop=True)
id_x date value sum_7days
0 ID1 2019-09-06 1.0 1.0
1 ID1 2019-09-11 1.0 2.0
2 ID1 2019-09-25 2.0 2.0
3 ID1 2019-09-27 1.0 3.0
4 ID1 2019-10-21 1.0 1.0
5 ID2 2019-10-15 1.0 1.0
6 ID2 2019-10-17 3.0 4.0
7 ID2 2019-10-19 2.0 6.0
8 ID2 2019-11-09 1.0 1.0
Upvotes: 1