Brian
Brian

Reputation: 2286

Group by/sum by across multiple rows based on conditions Python

The 'Name' column contains a persons name in one row, followed by the numbered task they need to perform (with a small description), all tasks are associated with that persons name until another persons name appears (So Tom has all the tasks below his name associated with Tom until Jim appears, then Jim is associated with all the tasks following his name, until the next name...and so on). So here is a sample of the data I have:

Name                       Three Digit Task    Number of Days

Tom                        BLANK               0.00
1.1.6.1 Task Description   1.1.6               9.00
1.1.6.2 Task Description   1.1.6               8.25
1.1.1.4 Task Description   1.1.1               13.25
Jim                        BLANK               0.00
1.1.3.1 Task Description   1.1.3               8.75
1.2.1.1 Task Description   1.2.1               6.00
1.2.1.2 Task Description   1.2.1               12.75

So I would like to get a sum of the number of days for each person, grouped by the Three Digit Task. Would like it to look something like this:

Tom      1.1.1     13.25
Tom      1.1.6     17.25
Jim      1.1.3     8.75
Jim      1.2.1     18.75

So I have tried using:

import string
ALPHA = string.ascii_letters
df['Name'].str.startswith(tuple(ALPHA))

This returns a true/false if the 'Name' column starts with a letter or not (True for a letter, false otherwise). Was trying to say something like: in between the true (which would be peoples names) sum the 'Number of Days' by grouping on the 'Three Digit Task'

Upvotes: 2

Views: 602

Answers (1)

piRSquared
piRSquared

Reputation: 294258

tl; dr

name_bool = df.Name.str.match('^[a-zA-Z]')
grp_keys = name_bool.cumsum()
grps = df.groupby(grp_keys)
tdt = 'Three Digit Task'
nod = 'Number of Days'

funcs = {'Name': 'first', nod: 'sum'}
dicts = {g.iloc[0, 0]: g.tail(-1).groupby(tdt).agg(funcs) for _, g in grps}
pd.concat(dicts)

enter image description here


explanation
use regex to find which rows have a Name column that starts with a letter.

name_bool = df.Name.str.match('^[a-zA-Z]')
name_bool  

0     True
1    False
2    False
3    False
4     True
5    False
6    False
7    False
Name: Name, dtype: bool

use cumsum to create a unique number for each contigous set of rows following a Name

grp_keys = name_bool.cumsum()
grp_keys

0    1
1    1
2    1
3    1
4    2
5    2
6    2
7    2
Name: Name, dtype: int64

create pandas groupby object
grps = df.groupby(grp_keys)

use agg and pd.concat to created final pd.DataFrame

funcs = {'Name': 'first', nod: 'sum'}
dicts = {g.iloc[0, 0]: g.tail(-1).groupby(tdt).agg(funcs) for _, g in grps}
pd.concat(dicts)

enter image description here

Upvotes: 1

Related Questions