Pandas DataFrame Custom GroupBy function received wrong data

Question

I'm having data from a time tracking tool. Within such data I'm trying to group the data by day and do some calculation on pauses, start and end time. Therefore I'm applying a custom group function. But somehow the data received in that function is always the one from group 1. It's always 2019-11-04 and never 2019-11-05. Any idea what am I doing wrong?

Input data:

    date_at minutes user_id start   end
0   2019-11-04  60  112826  2019-11-04 08:30:00 2019-11-04 09:30:00
1   2019-11-04  60  112826  2019-11-04 09:30:00 2019-11-04 10:30:00
2   2019-11-04  30  112826  2019-11-04 10:30:00 2019-11-04 11:00:00
3   2019-11-04  30  112826  2019-11-04 11:00:00 2019-11-04 11:30:00
4   2019-11-04  30  112826  2019-11-04 11:30:00 2019-11-04 12:00:00
5   2019-11-04  210 112826  2019-11-04 12:30:00 2019-11-04 16:00:00
6   2019-11-04  60  112826  2019-11-04 16:00:00 2019-11-04 17:00:00
7   2019-11-04  60  112826  2019-11-04 17:00:00 2019-11-04 18:00:00
8   2019-11-05  45  112826  2019-11-05 09:30:00 2019-11-05 10:15:00
9   2019-11-05  45  112826  2019-11-05 10:15:00 2019-11-05 11:00:00
10  2019-11-05  90  112826  2019-11-05 11:00:00 2019-11-05 12:30:00
11  2019-11-05  60  112826  2019-11-05 13:00:00 2019-11-05 14:00:00
12  2019-11-05  30  112826  2019-11-05 14:00:00 2019-11-05 14:30:00
13  2019-11-05  30  112826  2019-11-05 14:30:00 2019-11-05 15:00:00
14  2019-11-05  60  112826  2019-11-05 15:00:00 2019-11-05 16:00:00
15  2019-11-05  150 112826  2019-11-05 16:00:00 2019-11-05 18:30:00
16  2019-11-05  30  112826  2019-11-05 18:30:00 2019-11-05 19:00:00

My code:

def custom_groupby(data:pd.DataFrame):
    #print (data)
    data.sort_values(by=['start'], inplace=True, ascending=True)    
    times = list(zip(data.start, data.end))
    erroneous = False
    minutes_working = 0
    minutes_paused = 0

    l = len(times)
    day_begin = times[0][0]
    day_end = times[l-1][1]

    for i, (start1, end1) in enumerate(times):
        minutes_working += (end1 - start1).total_seconds() / 60   

        # is there a next entry?
        if i + 1 == l:
            break

        # get next entry
        start2, end2 = times[i+1]
        if start2 < end1:
            # start time of second entry is before end time of first entry
            erroneous = True
        else:
            minutes_paused += (start2 - end1).total_seconds() / 60

    if (day_end - day_begin).total_seconds() / 60 > 6 and minutes_paused == 0:
        # employee hasn't booked pauses but all as working time; 
        # or hasn't done any pause even though he/she is obliged to do so if working >6h
        erroneous = True
        # minimum pause according to Austrian labor law if working >6h
        minutes_paused = 30

    return pd.Series({
        'from':day_begin.strftime("%H:%M"),
        'to':day_end.strftime("%H:%M"),
        'hours_working':minutes_working / 60, 
        'hours_paused':minutes_paused / 60, 
        'error':erroneous})

df_sums = df.groupby(['date_at']).apply(custom_groupby)
df_sums

Results:

    from    to  hours_working   hours_paused    error
date_at                 
2019-11-04  08:30   18:00   9.0 0.5 False
2019-11-05  08:30   18:00   9.0 0.5 False

As you can see the grouped cell values for 2019-11-05 are the same as for 2019-11-04 and if you enable the print(data) in the custom function, you'll see that it's always group 1, but why?

mcsoini · Accepted Answer

This appears to be caused by the line

data.sort_values(by=['start'], inplace=True, ascending=True)

Since data is just a view (not a copy) of the original dataframe df, sorting with inplace=True inside the applied function is likely to cause similar trouble as modifying a list while looping over it. Changing the line to

data = data.sort_values(by=['start'], ascending=True)

creates a copy. Then the apply works as expected. Alternatively, you could experiment with sorting df prior to grouping, to avoid the additional memory use.

Pandas DataFrame Custom GroupBy function received wrong data

Answers (1)

Related Questions