Reputation: 2876
I am trying to group by a csv file in Pandas (by one column: ID) in order to get the earliest Start Date and latest End Date. Then I am trying to group by multiple columns in order to get the SUM of a value. For each ID in the second groupedby dataframe, I want to present the dates.
I am loading a csv in order to group and aggregate data.
01) First I load the csv
def get_csv():
#Read csv file
df = pd.read_csv('myFile.csv', encoding = "ISO-8859-1",parse_dates=['Start Date', 'End Date'])
return df
02) Group and aggregate the data for the columns (ID and Site)
def do_stuff():
df = get_csv()
groupedBy = df[df['A or B'].str.contains('AAAA')].groupby([df['ID'], df['Site'].fillna('Other'),]).agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
which works as expected and I am getting the following (example):
03) And ideally, for the same ID I want to present the earliest date in the Start Date column and the latest one in the End Date column. The aggregation for the value works perfectly. What I want to get is the following:
I do not know how to change my current code above. I have tried this so far:
def do_stuff():
df = get_csv()
md = get_csv()
minStart = md[md['A or B'].str.contains('AAAA')].groupby([md['ID']]).agg({'Start Date': 'min'})
df['earliestStartDate'] = minStart
groupedBy = df[df['A or B'].str.contains('AAAA')].groupby([df['ID'], df['Site'].fillna('Other'),df['earliestStartDate']]).agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
which fails and also tried changing the above to:
def do_stuff():
df = get_csv()
md = get_csv()
df['earliestStartDate'] = md.loc[ md['ID'] == df['ID'], 'Start Date'].min()
groupedBy = df[df['A or B'].str.contains('AAAA')].groupby([df['ID'], df['Site'].fillna('Other'),df['earliestStartDate']]).agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
Ideally, I will just change something in the groupedBy instead of having to read the csv twice and aggregate the data twice. Is that possible? If not, what can I change to make the script work? I am trying to test random things to get more experience in Pandas and Python.
I am guessing I have to create two dataframes
here. One to get the groupedby
data for all the columns needed (and the SUM of the Value). A second one to get the earliest Start Date and latest End Date for each ID. Then I need to find a way to concatenate the two dataframes
. Is that a good result or do you think that there is an easier way to achieve that?
UPD: My code where I have created two dataframes (not sure whether this is the right solution) is given below:
#Read csv file
df = pd.read_csv('myFile.csv', encoding = "ISO-8859-1",mangle_dupe_cols=True, parse_dates=['Start Date', 'End Date'])
md = pd.read_csv('myFile.csv', encoding = "ISO-8859-1",mangle_dupe_cols=True, parse_dates=['Start Date', 'End Date'])
#Calculate the Clean Value
df['Clean Cost'] = (df['Value'] - df['Value2']) #.apply(lambda x: round(x,0))
#Get the min/max Dates
minMaxDates = md[md['Random'].str.contains('Y')].groupby([md['ID']]).agg({'Start Date': 'min', 'End Date': 'max'})
#Group by and aggregate (return Earliest Start Date, Latest End Date and SUM of the Values)
groupedBy = df[df['Random'].str.contains('Y')].groupby([df['ID'], df['Site'].fillna('Other')]).agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum', 'Value2': 'sum', 'Clean Cost': 'sum'})
and if I print the two dataframes, I am getting the following:
and
If I print the df.head(), I am getting the following:
ID A or B Start Date End Date Value Site Value2 Random alse.
0 45221 AAAA 2017-12-30 2017-09-30 14 S111 7 Y 1 45221 AAAA 2017-01-15 2017-09-30 15 S222 7 Y 2 85293 BBBB 2017-05-12 2017-07-24 29 S111 3 Y 3 85293 AAAA 2017-03-22 2017-10-14 32 S222 4 Y 4 45221 AAAA 2017-01-15 2017-09-30 30 S222 7 Y
A link of the file is given here:LINK
Upvotes: 1
Views: 921
Reputation: 2876
I have managed to create a script that does what I want. I will paste the answer in case somebody needs it in the future. Jezrael's answer worked fine too. So, considering that the original csv is like this:
my sript is:
import pandas as pd
import os
import csv
import time
import dateutil.parser as dparser
import datetime
def get_csv():
#Read csv file
df = pd.read_csv('myFile.csv', encoding = "ISO-8859-1",mangle_dupe_cols=True, parse_dates=['Start Date', 'End Date'])
df = df[df['A or B'].str.contains('AAAA')]
return df
def do_stuff():
df = get_csv()
#Get the min Start Date, max End date, sum of the Value and Value2 and calculate the Net Cost
varA = 'ID';
dfGrouped = df.groupby(varA, as_index=False).agg({'Start Date': 'min', 'End Date': 'max'}).copy();
varsToKeep = ['ID', 'Site', 'Random', 'Start Date_grp', 'End Date_grp', 'Value', 'Value2', ];
dfTemp = pd.merge(df, dfGrouped, how='inner', on='ID', suffixes=(' ', '_grp'), copy=True)[varsToKeep];
dfBreakDown = dfTemp.groupby(['ID', 'Site', 'Random', 'Start Date_grp',
'End Date_grp']).sum()
#Calculate the Net Cost
dfTemp['Net Cost'] = (dfTemp['Value'] - dfTemp['Value2'])
groupedBy = dfTemp.groupby(['ID', 'Site', 'Random']).agg({'Start Date_grp': 'min', 'End Date_grp': 'max', 'Value': 'sum', 'Value2': 'sum', 'Net Cost': 'sum'})
csvoutput(groupedBy)
def csvoutput(df):
#Csv output
df.to_csv(path_or_buf='OUT.csv', sep=',', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None, mode='w', encoding=None, compression=None, quoting=None, quotechar='"', line_terminator='\n', chunksize=None, tupleize_cols=False, date_format=None, doublequote=True, escapechar=None, decimal='.')
if __name__ == "__main__":
# start things here
do_stuff()
Upvotes: 0
Reputation: 862691
I think you need transform
:
df = pd.read_csv('sampleBionic.csv')
print (df)
ID A or B Start Date End Date Value Site Value2 Random
0 45221 AAAA 12/30/2017 09/30/2017 14 S111 7 Y
1 45221 AAAA 01/15/2017 09/30/2017 15 S222 7 Y
2 85293 BBBB 05/12/2017 07/24/2017 29 S111 3 Y
3 85293 AAAA 03/22/2017 10/14/2017 32 S222 4 Y
4 45221 AAAA 01/15/2017 09/30/2017 30 S222 7 Y
groupedBy = (df[df['A or B'].str.contains('AAAA')]
.groupby([df['ID'], df['Site'].fillna('Other'),])
.agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'}))
print (groupedBy)
Start Date End Date Value
ID Site
45221 S111 12/30/2017 09/30/2017 14
S222 01/15/2017 09/30/2017 45
85293 S222 03/22/2017 10/14/2017 32
g = groupedBy.groupby(level=0)
groupedBy['Start Date'] = g['Start Date'].transform('min')
groupedBy['End Date'] = g['End Date'].transform('max')
print (groupedBy)
Start Date End Date Value
ID Site
45221 S111 01/15/2017 09/30/2017 14
S222 01/15/2017 09/30/2017 45
85293 S222 03/22/2017 10/14/2017 32
Upvotes: 2