vinyluis
vinyluis

Reputation: 23

Sum of values for the same ID in a time window - Python

I have the following dataframe, which is the result of a groupby operation:

df = pd.DataFrame(np.array([
    ['ID1','2019-09-06',1],
    ['ID1','2019-09-11',1],
    ['ID1','2019-09-25',2],
    ['ID1','2019-09-27',1],
    ['ID1','2019-10-21',1],
    ['ID2','2019-10-15',1],
    ['ID2','2019-10-17',3],
    ['ID2','2019-10-19',2],
    ['ID2','2019-11-09',1],
]), columns = ["id", "date", "value"])

And I want to have the sum of the "value" column for the same ID in a time window ending on "date". The expected output for a 7d window would be:

expected = pd.DataFrame(np.array([
    ['ID1','2019-09-06',1,1],
    ['ID1','2019-09-11',1,2],
    ['ID1','2019-09-25',2,2],
    ['ID1','2019-09-27',1,3],
    ['ID1','2019-10-21',1,1],
    ['ID2','2019-10-15',1,1],
    ['ID2','2019-10-17',3,4],
    ['ID2','2019-10-19',2,6],
    ['ID2','2019-11-09',1,1],
]), columns = ["id", "date", "value", "sum of values in 7d"])

I already have a code that works for that case, but it does not feel like the most clean solution. Also it's really slow when you have tens of thousands of lines (which is my case).

The function I have is:

def countPeriod(i, d, df, p = 7):
    # Slices a subset with only the corresponding ID
    aux = df[df["id"] == i]
    # Gets the period
    period = pd.date_range(end = d, periods = p)
    # Sums "value" for each date in period
    s = 0
    for d in period:
        try:
            aux_date = aux[aux["date"] == d]
            s = s + (aux_date["value"].sum())
        except:
            pass
    return s

It returns the expected value, when I call it with:

result = df.copy()
result["date"] = pd.to_datetime(result["date"], dayfirst = True, format = "%Y-%m-%d").dt.date
result["value"] = result["value"].astype(int)
result["sum of values in 7d"] = result.apply(lambda x: countPeriod(x["id"], x["date"], result, 7), axis = 1)

I cannot think of other ways to do that. I even took a look on this topic, but it also doesn't seem to be well fitted to my problem.

Is there any cleaner and faster way to do that? I have to do such operations lots of times with the data I'm working on.

Upvotes: 0

Views: 799

Answers (2)

vinyluis
vinyluis

Reputation: 23

The great example posted by r-beginners made me think of another solution to this problem.

A single function that runs on the dataset instead of using .apply(), but I used .rolling() as the way to get the results more easily.

def countPeriod3(df, p = 7):
    # Gets the IDs
    ids = df['id'].unique()
    # Creates an empty dataframe to hold all slices
    result = pd.DataFrame(columns = ['id', 'date', 'value', 'sum7d'])

    for i in ids:
        # Slices a subset with only the corresponding ID
        df_slice = df[df["id"] == i] 
        # Sets date as index to use asfreq
        aux = df_slice.set_index('date')
        # Uses a daily frequency to fill the gaps and fills NA accordingly
        aux = aux.asfreq('d')
        aux['id'] = aux["id"].fillna(i)
        aux['value'] = aux['value'].fillna(0)
        # Rolling window to sum 7 days
        aux['sum7d'] = aux['value'].rolling(p, min_periods=1).sum()
        # Puts date back as a column
        aux.reset_index()
        # Deletes redundant columns
        aux = aux.drop(columns = ['id', 'value'])
        # Gets only the lines that appear on the slice
        df_slice = df_slice.merge(aux, on='date', how='left')
        # Puts all slices together
        result = pd.concat([result,df_slice], ignore_index = True)

    return result

In fact, I compared the run time for all of the codes with timeit, and got the following:

  • First code = 43.2 ms
  • r-beginners code = 26.6 ms
  • The code above = 11.7 ms

Thanks once again, r-beginners for the help!

Upvotes: 0

r-beginners
r-beginners

Reputation: 35125

I have considered only the basic features of PANDAS.
1.Using Grouper to create a DF that includes a record other than the date
2.Combine the original DF with the created DF
3.7 days to add a column.
4.Calculate the judgment for 7 days in the loop process
5.Joining to an empty DF
I'm not sure about the processing speed.

df['date'] = pd.to_datetime(df['date'])
df['value'] = df['value'].astype(int)

id_list = df['id'].unique()

new_df = pd.DataFrame(index=[], columns=['date','cnt','id_x','sum_7days'])
for k in id_list:
     df_dict = df[df['id'] == k].groupby(pd.Grouper(key='date', freq='1d')).groups
     all_day = pd.DataFrame({'date':list(df_dict.keys()), 'cnt':1}, index=np.arange(len(df_dict)))
    all_day['id'] = k
    all_day = all_day.merge(df[df['id'] == k], on='date', how='outer').fillna(0)
    all_day.drop('id_y', inplace=True, axis=1)

    day_7 = tmp = 0
    all_day['sum_7days'] = 0
    for i in range(len(all_day)):
        if day_7 <= 7:
            day_7 += all_day['cnt'].iloc[i]
            tmp += all_day['value'].iloc[i]
            all_day['sum_7days'].iloc[i] = tmp
        elif day_7 > 7:
            tmp = day_7 = 0
    new_df = pd.concat([new_df,all_day], ignore_index=True)
  new_df = new_df[~(new_df['value'] == 0.0)]
    new_df = new_df.loc[:,['id_x','date','value','sum_7days']].reset_index(drop=True)

    id_x    date    value   sum_7days
 0  ID1 2019-09-06  1.0 1.0
 1  ID1 2019-09-11  1.0 2.0
 2  ID1 2019-09-25  2.0 2.0
 3  ID1 2019-09-27  1.0 3.0
 4  ID1 2019-10-21  1.0 1.0
 5  ID2 2019-10-15  1.0 1.0
 6  ID2 2019-10-17  3.0 4.0
 7  ID2 2019-10-19  2.0 6.0
 8  ID2 2019-11-09  1.0 1.0

Upvotes: 1

Related Questions