How to groupby and then write the result to csv (and more)

Question

My data looks like this:

BOL,StopSequence,TimeArrived
5076223,1,12:52:56 PM
5076223,1,12:52:56 PM
5076223,2,9:50:58 AM
5076223,3,11:00:32 AM
5076223,4,11:00:52 AM
5077138,1,5:00:45 AM
5077138,2,1:43:13 PM
5077138,3,12:29:39 PM
5077138,4,1:02:31 PM
5077138,4,1:02:31 PM
5077138,5,1:02:50 PM
5077138,5,1:02:50 PM
5077138,5,1:02:50 PM
5077138,5,1:02:50 PM

As you can see, BOL is my index. There are multiple events happening for one BOL. Sometimes, the event repeats itself (notice line 2 and 3 StopSequence is 1).

I need to do a few things:

Group the dataframe by BOL
If the events repeat itself inside the BOL group, remove - I want only one
Order the events in the BOL group by TimeArrived
Add a new column called "TotalTimeArrived", which is the SUM of the TimeArrived. I don't mind if the value repeats itself for every event inside the BOL group but happy to hear suggestions

So what accomplish, this is what I have done so far:

df = pd.read_csv('./data/simple.csv', skipinitialspace=True)

# Pre Processing Stage

# Turn TimeArrived feature into timedelta
df['TimeArrived'] = pd.to_timedelta(df['TimeArrived'].str.strip())

# Group by BOL and for every group, sort by TimeArrived
df = df.sort_values(['BOL', 'TimeArrived'], ascending=True).groupby('BOL')

unique_bol_count = df.BOL.nunique().count()
print("There are {} unique BOL in this file".format(unique_bol_count))

So I was able to group the dataset and order by TimeArrived. Now I am ending up with a DataframeGroupBy and I don't know how to take it from there.

I am trying to add the TotalTimeArrived like this df['TotalTimeArrived'] = df['TimeArrived'].sum() but the error is:

TypeError: 'DataFrameGroupBy' object does not support item assignment

Finally, when trying to write the result csv with df.to_csv('./result.csv') I get

AttributeError: Cannot access callable attribute 'to_csv' of 'DataFrameGroupBy' objects, try using the 'apply' method.

I also have not been able to remove the repeated event inside the BOL grouping just yet.

Any help is much appreciated, thanks

jezrael · Accepted Answer

Here is problem output of .groupby('BOL') is 'DataFrameGroupBy' object, so necessary add functions - here for new column GroupBy.transform.

Also for remove duplicates is added DataFrame.drop_duplicates.

df['TimeArrived'] = pd.to_timedelta(df['TimeArrived'].str.strip())

df = (df.drop_duplicates(['BOL','TimeArrived'])
        .sort_values(['BOL', 'TimeArrived'], ascending=True))


unique_bol_count = df.BOL.nunique()
print("There are {} unique BOL in this file".format(unique_bol_count))
There are 2 unique BOL in this file

df['TotalTimeArrived'] = df.groupby('BOL')['TimeArrived'].transform('sum')

print (df)
        BOL  StopSequence TimeArrived TotalTimeArrived
2   5076223             2    09:50:58  1 days 20:45:18
3   5076223             3    11:00:32  1 days 20:45:18
4   5076223             4    11:00:52  1 days 20:45:18
0   5076223             1    12:52:56  1 days 20:45:18
8   5077138             4    01:02:31  0 days 21:18:58
10  5077138             5    01:02:50  0 days 21:18:58
6   5077138             2    01:43:13  0 days 21:18:58
5   5077138             1    05:00:45  0 days 21:18:58
7   5077138             3    12:29:39  0 days 21:18:58

df.to_csv('./result.csv', index=False)

How to groupby and then write the result to csv (and more)

Answers (1)

Related Questions