Pandas group, aggregate two columns and return the earliest Start Date for one column

Question

I am trying to group by a csv file in Pandas (by one column: ID) in order to get the earliest Start Date and latest End Date. Then I am trying to group by multiple columns in order to get the SUM of a value. For each ID in the second groupedby dataframe, I want to present the dates.

I am loading a csv in order to group and aggregate data.

01) First I load the csv

def get_csv():
        #Read csv file
        df = pd.read_csv('myFile.csv', encoding = "ISO-8859-1",parse_dates=['Start Date', 'End Date'])

        return df

02) Group and aggregate the data for the columns (ID and Site)

def do_stuff():
     df = get_csv()   
     groupedBy = df[df['A or B'].str.contains('AAAA')].groupby([df['ID'], df['Site'].fillna('Other'),]).agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})

which works as expected and I am getting the following (example):

03) And ideally, for the same ID I want to present the earliest date in the Start Date column and the latest one in the End Date column. The aggregation for the value works perfectly. What I want to get is the following:

I do not know how to change my current code above. I have tried this so far:

def do_stuff():
    df = get_csv()
    md = get_csv()

    minStart = md[md['A or B'].str.contains('AAAA')].groupby([md['ID']]).agg({'Start Date': 'min'})

    df['earliestStartDate'] = minStart

    groupedBy = df[df['A or B'].str.contains('AAAA')].groupby([df['ID'], df['Site'].fillna('Other'),df['earliestStartDate']]).agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})

which fails and also tried changing the above to:

def do_stuff():
    df = get_csv()
    md = get_csv()

    df['earliestStartDate'] = md.loc[ md['ID'] == df['ID'], 'Start Date'].min()

    groupedBy = df[df['A or B'].str.contains('AAAA')].groupby([df['ID'], df['Site'].fillna('Other'),df['earliestStartDate']]).agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})

Ideally, I will just change something in the groupedBy instead of having to read the csv twice and aggregate the data twice. Is that possible? If not, what can I change to make the script work? I am trying to test random things to get more experience in Pandas and Python.

I am guessing I have to create two dataframes here. One to get the groupedby data for all the columns needed (and the SUM of the Value). A second one to get the earliest Start Date and latest End Date for each ID. Then I need to find a way to concatenate the two dataframes. Is that a good result or do you think that there is an easier way to achieve that?

UPD: My code where I have created two dataframes (not sure whether this is the right solution) is given below:

#Read csv file
df = pd.read_csv('myFile.csv', encoding = "ISO-8859-1",mangle_dupe_cols=True, parse_dates=['Start Date', 'End Date'])
md = pd.read_csv('myFile.csv', encoding = "ISO-8859-1",mangle_dupe_cols=True, parse_dates=['Start Date', 'End Date'])


#Calculate the Clean Value
df['Clean Cost'] = (df['Value'] - df['Value2']) #.apply(lambda x: round(x,0))

#Get the min/max Dates
minMaxDates = md[md['Random'].str.contains('Y')].groupby([md['ID']]).agg({'Start Date': 'min', 'End Date': 'max'})

#Group by and aggregate (return Earliest Start Date, Latest End Date and SUM of the Values)
groupedBy = df[df['Random'].str.contains('Y')].groupby([df['ID'], df['Site'].fillna('Other')]).agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum', 'Value2': 'sum', 'Clean Cost': 'sum'})

and if I print the two dataframes, I am getting the following:

and

If I print the df.head(), I am getting the following:

  ID A or B Start Date   End Date  Value  Site  Value2 Random                                                 alse.

0 45221 AAAA 2017-12-30 2017-09-30 14 S111 7 Y 1 45221 AAAA 2017-01-15 2017-09-30 15 S222 7 Y 2 85293 BBBB 2017-05-12 2017-07-24 29 S111 3 Y 3 85293 AAAA 2017-03-22 2017-10-14 32 S222 4 Y 4 45221 AAAA 2017-01-15 2017-09-30 30 S222 7 Y

A link of the file is given here:LINK

jezrael · Accepted Answer

I think you need transform:

df = pd.read_csv('sampleBionic.csv')
print (df)
      ID A or B  Start Date    End Date  Value  Site  Value2 Random
0  45221   AAAA  12/30/2017  09/30/2017     14  S111       7      Y
1  45221   AAAA  01/15/2017  09/30/2017     15  S222       7      Y
2  85293   BBBB  05/12/2017  07/24/2017     29  S111       3      Y
3  85293   AAAA  03/22/2017  10/14/2017     32  S222       4      Y
4  45221   AAAA  01/15/2017  09/30/2017     30  S222       7      Y

groupedBy = (df[df['A or B'].str.contains('AAAA')]
                            .groupby([df['ID'], df['Site'].fillna('Other'),])
                            .agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'}))
print (groupedBy)    
            Start Date    End Date  Value
ID    Site                               
45221 S111  12/30/2017  09/30/2017     14
      S222  01/15/2017  09/30/2017     45
85293 S222  03/22/2017  10/14/2017     32

g = groupedBy.groupby(level=0)              
groupedBy['Start Date'] = g['Start Date'].transform('min') 
groupedBy['End Date'] = g['End Date'].transform('max')
print (groupedBy)
            Start Date    End Date  Value
ID    Site                               
45221 S111  01/15/2017  09/30/2017     14
      S222  01/15/2017  09/30/2017     45
85293 S222  03/22/2017  10/14/2017     32

Pandas group, aggregate two columns and return the earliest Start Date for one column

Answers (2)

Related Questions