shahid hamdam
shahid hamdam

Reputation: 821

Sum a column based on groupby and condition

I have a dataframe and some columns. I want to sum column "Gap" where time is in some time slots.

   region.    date.   time.     gap
0   1   2016-01-01  00:00:08    1
1   1   2016-01-01  00:00:48    0
2   1   2016-01-01  00:02:50    1
3   1   2016-01-01  00:00:52    0
4   1   2016-01-01  00:10:01    0
5   1   2016-01-01  00:10:03    1
6   1   2016-01-01  00:10:05    0
7   1   2016-01-01  00:10:08    0

I want to sum gap column. I have time slots in dict like that.

'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'

Now after summation, above dataframe should like that.

 region.    date.       time.      gap
0   1   2016-01-01  00:10:00/slot1  2
1   1   2016-01-01  00:20:00/slot2  1

I have many regions and 144 time slots from 00:00:00 to 23:59:49. I have tried this.

regres=reg.groupby(['start_region_hash','Date','Time'])['Time'].apply(lambda x: (x >= hoursdict['slot1']) & (x <= hoursdict['slot2'])).sum()

But it doesn't work.

Upvotes: 1

Views: 294

Answers (3)

jezrael
jezrael

Reputation: 862406

Idea is convert column time to datetimes with floor by 10Min, then convert to strings HH:MM:SS:

d = {'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'}
d1 = {v:k for k, v in d.items()}

df['time'] = pd.to_datetime(df['time']).dt.floor('10Min').dt.strftime('%H:%M:%S')
print (df)
   region        date      time  gap
0       1  2016-01-01  00:00:00    1
1       1  2016-01-01  00:00:00    0
2       1  2016-01-01  00:00:00    1
3       1  2016-01-01  00:00:00    0
4       1  2016-01-01  00:10:00    0
5       1  2016-01-01  00:10:00    1
6       1  2016-01-01  00:10:00    0
7       1  2016-01-01  00:10:00    0

Aggregate sum and last map values by dictionary with swapped keys with values:

regres = df.groupby(['region','date','time'], as_index=False)['gap'].sum()
regres['time'] = regres['time'] + '/' + regres['time'].map(d1)
print (regres)
   region        date            time  gap
0       1  2016-01-01  00:00:00/slot1    2
1       1  2016-01-01  00:10:00/slot2    1

If want display next 10Min slots:

d = {'slot1': '00:00:00', 'slot2': '00:10:00', 'slot3': '00:20:00'}
d1 = {v:k for k, v in d.items()}

times = pd.to_datetime(df['time']).dt.floor('10Min')
df['time'] = times.dt.strftime('%H:%M:%S')
df['time1'] = times.add(pd.Timedelta('10Min')).dt.strftime('%H:%M:%S')
print (df)
   region        date      time  gap     time1
0       1  2016-01-01  00:00:00    1  00:10:00
1       1  2016-01-01  00:00:00    0  00:10:00
2       1  2016-01-01  00:00:00    1  00:10:00
3       1  2016-01-01  00:00:00    0  00:10:00
4       1  2016-01-01  00:10:00    0  00:20:00
5       1  2016-01-01  00:10:00    1  00:20:00
6       1  2016-01-01  00:10:00    0  00:20:00
7       1  2016-01-01  00:10:00    0  00:20:00

regres = df.groupby(['region','date','time','time1'], as_index=False)['gap'].sum()
regres['time'] = regres.pop('time1') + '/' + regres['time'].map(d1)
print (regres)
   region        date            time  gap
0       1  2016-01-01  00:10:00/slot1    2
1       1  2016-01-01  00:20:00/slot2    1

EDIT:

Improvement for floor and convert to strings is use bining by cut or searchsorted:

df['time'] = pd.to_timedelta(df['time'])

bins = pd.timedelta_range('00:00:00', '24:00:00', freq='10Min')
labels = np.array(['{}'.format(str(x)[-8:]) for x in bins])
labels = labels[:-1]

df['time1'] = pd.cut(df['time'], bins=bins, labels=labels)
df['time11'] = labels[np.searchsorted(bins, df['time'].values) - 1]

Upvotes: 2

ycx
ycx

Reputation: 3211

The way to think about approaching this problem is converting your time column to the values you want first, and then doing a groupby sum of the time column.

The below code shows the approach I've used. I used np.select to include in as many conditions and condition options as I want. After I have converted time to the values I wanted, I did a simple groupby sum None of the fuss of formatting time or converting strings etc is really needed. Simply let pandas dataframe handle it intuitively.

#Just creating the DataFrame using a dictionary here
regdict = {
        'time': ['00:00:08','00:00:48','00:02:50','00:00:52','00:10:01','00:10:03','00:10:05','00:10:08'],
        'gap': [1,0,1,0,0,1,0,0],}

df = pd.DataFrame(regdict)


import pandas as pd
import numpy as np #This is the library you require for np.select function    

#Add in all your conditions and options here
condlist = [df['time']<'00:10:00',df['time']<'00:20:00'] 
choicelist = ['00:10:00/slot1','00:20:00/slot2'] 

#Use np.select after you have defined all your conditions and options
answerlist = np.select(condlist, choicelist)
print (answerlist)
['00:10:00/slot1' '00:10:00/slot1' '00:10:00/slot1' '00:10:00/slot1'
'00:20:00/slot2' '00:20:00/slot2' '00:20:00/slot2' '00:20:00/slot2']

#Assign answerlist to df['time']
df['time'] = answerlist
print (df)
       time  gap
0  00:10:00    1
1  00:10:00    0
2  00:10:00    1
3  00:10:00    0
4  00:20:00    0
5  00:20:00    1
6  00:20:00    0
7  00:20:00    0

df = df.groupby('time', as_index=False)['gap'].sum()
print (df) 
       time  gap
0  00:10:00    2
1  00:20:00    1

If you wish to keep the original time you can instead do df['timeNew'] = answerlist and then filter from there.

df['timeNew'] = answerlist
print (df)
       time  gap         timeNew
0  00:00:08    1  00:10:00/slot1
1  00:00:48    0  00:10:00/slot1
2  00:02:50    1  00:10:00/slot1
3  00:00:52    0  00:10:00/slot1
4  00:10:01    0  00:20:00/slot2
5  00:10:03    1  00:20:00/slot2
6  00:10:05    0  00:20:00/slot2
7  00:10:08    0  00:20:00/slot2

#Use transform function here to retain all prior values
df['aggregate sum of gap'] = df.groupby('timeNew')['gap'].transform(sum)
print (df) 
       time  gap         timeNew  aggregate sum of gap
0  00:00:08    1  00:10:00/slot1                     2
1  00:00:48    0  00:10:00/slot1                     2
2  00:02:50    1  00:10:00/slot1                     2
3  00:00:52    0  00:10:00/slot1                     2
4  00:10:01    0  00:20:00/slot2                     1
5  00:10:03    1  00:20:00/slot2                     1
6  00:10:05    0  00:20:00/slot2                     1
7  00:10:08    0  00:20:00/slot2                     1

Upvotes: 0

Indominus
Indominus

Reputation: 1248

Just to avoid the complication of the Datetime comparison (unless that is your whole point, in which case ignore my answer), and show the essence of this group by slot window problem, I here assume times are integers.

df = pd.DataFrame({'time':[8, 48, 250, 52, 1001, 1003, 1005, 1008, 2001, 2003, 2056], 
                   'gap': [1, 0,  1,   0,  0,    1,    0,    0,    1,    1,    1]})
slots = np.array([0, 1000, 1500])
df['slot'] = df.apply(func = lambda x: slots[np.argmax(slots[x['time']>slots])], axis=1)
df.groupby('slot')[['gap']].sum()

Output

       gap
slot    
-----------
0       2
1000    1
1500    3

Upvotes: 0

Related Questions