Pandas grouping and combining rows

Question

I have a large DataFrame that looks like this:

+----------------------------------------------------------+
| Date           Category        Location       ImpactRate |
+----------------------------------------------------------+
| 2018-04-22     Outage          MT               0.05194  |
| 2018-04-22     Outage          ND               0.02552  |
| 2018-04-22     Outage          SD               0.09962  |
| 2018-04-24     Transport       TX               0.03111  |
+----------------------------------------------------------+

What I'm trying to do is create the following output:

+-----------------------------------------------------------------------------------+
|   Date           Category        ImpactRate        Break Down             |
+-----------------------------------------------------------------------------------+
| 2018-04-22     Outage          0.17708           MT (29.3%) SD (14.4%) ND (56.3%) |
| 2018-04-24    Transport       0.03111            TX (100.0%)                      |
+-----------------------------------------------------------------------------------+

First Attempt - GroupBy

grouped_df = df.groupby('Date')['ImpactRate'].apply(list).apply(pd.Series).rename(columns=df['Location'])'

This creates a DataFrame that will contain NaNs where each Location isn't present and creates additional columns that require formatting.

Second Attempt - Loop using itertuples():

r = []
for item in df.itertuples():
    temp_x = df.loc[((df['Category'] == item[2]) & (df['Date'] == item[1]))
    for i in range(temp_x.shape[0]):
        r.append(temp_x['ImpactRate'].iloc[i])

This creates one giant list of ImpactRate - which takes me back to square one.

I don't know enough to understand how to troubleshoot this. I'm guessing I should make a list within a list for each iteration but I keep going in circles. How can I achieve this output in the most pythonic way? (Please explain so that I can learn!)

jpp · Accepted Answer

This is one way using vectorised functionality revolving around groupby.transform. I have defined Breakdown series as lists of tuples, as this is the most flexible format. You can apply specific string formats if you wish.

import pandas as pd

df = pd.DataFrame({'Date': ['2018-04-22', '2018-04-22', '2018-04-22', '2018-04-24'],
                   'Category': ['Outage', 'Outage', 'Outage', 'Transport'],
                   'Location': ['MT', 'ND', 'SD', 'TX'],
                   'ImpactRate': [0.05194, 0.02552, 0.09962, 0.03111]})

# apply ratio
df['Total'] = df.groupby(['Date', 'Category'])['ImpactRate'].transform('sum')
df['ImpactRate'] /= df['Total']

# create tuple column
df['Breakdown'] = list(zip(df.Location, df.ImpactRate))

# groupby to list
df = df.groupby(['Category', 'Date', 'Total'])['Breakdown'].apply(list).reset_index()

Result:

print(df)

    Category        Date    Total  \
0     Outage  2018-04-22  0.17708   
1  Transport  2018-04-24  0.03111   

                                           Breakdown  
0  [(MT, 0.293313756494), (ND, 0.144115653942), (...  
1                                        [(TX, 1.0)]

Pandas grouping and combining rows

Answers (2)

Related Questions