Pandas - Compute data from a column to another

Question

Considering the following dataframe:

df = pd.read_json("""{"week":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"6":2,"7":2,"8":2,"9":2,"10":2,"11":2,"12":3,"13":3,"14":3,"15":3,"16":3,"17":3},"extra_hours":{"0":"01:00:00","1":"00:00:00","2":"01:00:00","3":"01:00:00","4":"00:00:00","5":"01:00:00","6":"01:00:00","7":"01:00:00","8":"01:00:00","9":"01:00:00","10":"00:00:00","11":"01:00:00","12":"01:00:00","13":"02:00:00","14":"01:00:00","15":"02:00:00","16":"00:00:00","17":"00:00:00"},"extra_hours_over":{"0":null,"1":null,"2":null,"3":null,"4":null,"5":null,"6":null,"7":null,"8":null,"9":null,"10":null,"11":null,"12":null,"13":null,"14":null,"15":null,"16":null,"17":null}}""")
df.tail(6)

    week extra_hours  extra_hours_over
12     3    01:00:00               NaN
13     3    02:00:00               NaN
14     3    01:00:00               NaN
15     3    02:00:00               NaN
16     3    00:00:00               NaN
17     3    00:00:00               NaN

Now, in every week, the maximum amount of extra_hours is 4h, meaning I have to subtract 30min blocks from extra_hour column, and fill the extra_hour_over column, so that in every week, total sum of extra_hour has a maximum of 4h.

So, given the example dataframe, a possible solution (for week 3) would be like this:

    week  extra_hours  extra_hours_over
12     3     01:00:00          00:00:00
13     3     01:30:00          00:30:00
14     3     00:30:00          00:30:00
15     3     01:00:00          01:00:00
16     3     00:00:00          00:00:00
17     3     00:00:00          00:00:00

I would need to aggregate total extra_hours per week, check in which days it passes 4h, and then randomly subtract half-hour chunks.

What would be the easiest/most direct way to achieve this?

rrcal · Accepted Answer

Here goes one attempt for what you seem to be asking. The idea is simple, although the code fairly verbose:

1) Create some helper variables (minutes, extra_minutes, total for the week)

2) Loop through a temporary dataset that will contain only while sum is > 240 minutes.

3) In the loop, use random.choice to select a time to remove 30 min from.

4) Apply the changes to minutes and extra minutes

The code:

df = pd.read_json("""{"week":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"6":2,"7":2,"8":2,"9":2,"10":2,"11":2,"12":3,"13":3,"14":3,"15":3,"16":3,"17":3},"extra_hours":{"0":"01:00:00","1":"00:00:00","2":"01:00:00","3":"01:00:00","4":"00:00:00","5":"01:00:00","6":"01:00:00","7":"01:00:00","8":"01:00:00","9":"01:00:00","10":"00:00:00","11":"01:00:00","12":"01:00:00","13":"02:00:00","14":"01:00:00","15":"02:00:00","16":"00:00:00","17":"00:00:00"},"extra_hours_over":{"0":null,"1":null,"2":null,"3":null,"4":null,"5":null,"6":null,"7":null,"8":null,"9":null,"10":null,"11":null,"12":null,"13":null,"14":null,"15":null,"16":null,"17":null}}""")

df['minutes'] = pd.DatetimeIndex(df['extra_hours']).hour * 60 + pd.DatetimeIndex(df['extra_hours']).minute
df['extra_minutes'] = 0

df['tot_time'] =  df.groupby('week')['minutes'].transform('sum')

while not df[df['tot_time'] > 240].empty:
    mask = df[(df['minutes']>=30)&(df['tot_time']>240)].groupby('week').apply(lambda x: np.random.choice(x.index)).values
    df.loc[mask,'minutes'] -= 30
    df.loc[mask,'extra_minutes'] += 30

    df['tot_time'] =  df.groupby('week')['minutes'].transform('sum')

df['extra_hours_over'] = df['extra_minutes'].apply(lambda x: pd.Timedelta(minutes=x))
df['extra_hours'] = df['minutes'].apply(lambda x: pd.Timedelta(minutes=x))
df.drop(['minutes','extra_minutes'], axis=1).tail(6)

Out[1]:
    week    extra_hours     extra_hours_over    tot_time
12  3       00:30:00        00:30:00             240
13  3       01:30:00        00:30:00             240
14  3       00:30:00        00:30:00             240
15  3       01:30:00        00:30:00             240
16  3       00:00:00        00:00:00             240
17  3       00:00:00        00:00:00             240

Note: Because I am using np.random.choice, the same observation can be picked twice, which will make that observation change by a chunk of more than 30 min.

Pandas - Compute data from a column to another

Answers (1)

Related Questions