Reputation: 2503
My data looks like this:
BOL,StopSequence,TimeArrived
5076223,1,12:52:56 PM
5076223,1,12:52:56 PM
5076223,2,9:50:58 AM
5076223,3,11:00:32 AM
5076223,4,11:00:52 AM
5077138,1,5:00:45 AM
5077138,2,1:43:13 PM
5077138,3,12:29:39 PM
5077138,4,1:02:31 PM
5077138,4,1:02:31 PM
5077138,5,1:02:50 PM
5077138,5,1:02:50 PM
5077138,5,1:02:50 PM
5077138,5,1:02:50 PM
As you can see, BOL is my index. There are multiple events happening for one BOL. Sometimes, the event repeats itself (notice line 2 and 3 StopSequence is 1).
I need to do a few things:
So what accomplish, this is what I have done so far:
df = pd.read_csv('./data/simple.csv', skipinitialspace=True)
# Pre Processing Stage
# Turn TimeArrived feature into timedelta
df['TimeArrived'] = pd.to_timedelta(df['TimeArrived'].str.strip())
# Group by BOL and for every group, sort by TimeArrived
df = df.sort_values(['BOL', 'TimeArrived'], ascending=True).groupby('BOL')
unique_bol_count = df.BOL.nunique().count()
print("There are {} unique BOL in this file".format(unique_bol_count))
So I was able to group the dataset and order by TimeArrived. Now I am ending up with a DataframeGroupBy and I don't know how to take it from there.
I am trying to add the TotalTimeArrived like this df['TotalTimeArrived'] = df['TimeArrived'].sum()
but the error is:
TypeError: 'DataFrameGroupBy' object does not support item assignment
Finally, when trying to write the result csv with df.to_csv('./result.csv')
I get
AttributeError: Cannot access callable attribute 'to_csv' of 'DataFrameGroupBy' objects, try using the 'apply' method.
I also have not been able to remove the repeated event inside the BOL grouping just yet.
Any help is much appreciated, thanks
Upvotes: 1
Views: 2761
Reputation: 862751
Here is problem output of .groupby('BOL')
is 'DataFrameGroupBy' object
, so necessary add functions - here for new column GroupBy.transform
.
Also for remove duplicates is added DataFrame.drop_duplicates
.
df['TimeArrived'] = pd.to_timedelta(df['TimeArrived'].str.strip())
df = (df.drop_duplicates(['BOL','TimeArrived'])
.sort_values(['BOL', 'TimeArrived'], ascending=True))
unique_bol_count = df.BOL.nunique()
print("There are {} unique BOL in this file".format(unique_bol_count))
There are 2 unique BOL in this file
df['TotalTimeArrived'] = df.groupby('BOL')['TimeArrived'].transform('sum')
print (df)
BOL StopSequence TimeArrived TotalTimeArrived
2 5076223 2 09:50:58 1 days 20:45:18
3 5076223 3 11:00:32 1 days 20:45:18
4 5076223 4 11:00:52 1 days 20:45:18
0 5076223 1 12:52:56 1 days 20:45:18
8 5077138 4 01:02:31 0 days 21:18:58
10 5077138 5 01:02:50 0 days 21:18:58
6 5077138 2 01:43:13 0 days 21:18:58
5 5077138 1 05:00:45 0 days 21:18:58
7 5077138 3 12:29:39 0 days 21:18:58
df.to_csv('./result.csv', index=False)
Upvotes: 1