Diop Chopra
Diop Chopra

Reputation: 319

Using Grouped results to create Dictionary

I am trying to summarise some data and then after filtering I would to output a dictionary based on the final results.

Please see below what I have being able to figure out thus far and an explanation of the steps I am looking for help with. Thanks in advance.

import pandas as pd
import numpy as np

data = [['New York','Female','Green',523.5,40.614256], ['New York','Female','Green',176.5,46.168092], 
['New York','Male','Green',39.5,58.970444], ['New York','Female','Purple',40.5,40.739437], 
['New York','Male','Purple',12.5,46.854553], ['New York','Male','Green',187.5,59.477795], 
['New York','Male','Green',21.5,59.283682], ['New York','Female','Green',0.5,46.772469],
['New York','Male','Green',48.5,58.847725], ['New York','Female','Purple',17.5,40.642568],
['New York','Male','Green',27.5,46.911063]]

df = pd.DataFrame(data, columns = ['city', 'gender', 'color', 'cm ', 'temp'])

#create `cuts` for `temp`
cuts = np.arange(0, 100.25, 0.25) # range for 'cuts'
df['temp_range'] = pd.cut(df.temp, cuts, include_lowest=True)

print(df)

        city  gender   color    cm        temp     temp_range
0   New York  Female   Green  523.5  40.614256  (40.5, 40.75]
1   New York  Female   Green  176.5  46.168092  (46.0, 46.25]
2   New York    Male   Green   39.5  58.970444  (58.75, 59.0]
3   New York  Female  Purple   40.5  40.739437  (40.5, 40.75]
4   New York    Male  Purple   12.5  46.854553  (46.75, 47.0]
5   New York    Male   Green  187.5  59.477795  (59.25, 59.5]
6   New York    Male   Green   21.5  59.283682  (59.25, 59.5]
7   New York  Female   Green    0.5  46.772469  (46.75, 47.0]
8   New York    Male   Green   48.5  58.847725  (58.75, 59.0]
9   New York  Female  Purple   17.5  40.642568  (40.5, 40.75]
10  New York    Male   Green   27.5  46.911063  (46.75, 47.0]

At this point is where I become stuck.

Stuck - Part 1

What I want to do next is produce a summary that outputs as follows:

        city  gender      temp_range    Green_count  Purple_count  Green_pct   Green_sum    Purple_sum     Green_max  Purple_max                                    
0   New York  Male     (46.75, 47.0]              1             1       0.50        27.5          12.5          27.5        12.5                                                                                      
1   New York  Male     (58.75, 59.0]              2           NaN       1.00        88.0           NaN          48.5         NaN                                                                                                                  
2   New York  Male     (59.25, 59.5]              2           NaN       1.00       209.0           NaN         187.5         NaN
3   New York  Female   (40.5, 40.75]              1             2       0.33       523.5          58.0         523.5        40.5                                                
4   New York  Female   (46.0, 46.25]              1           NaN       1.00       176.5           NaN         176.5         NaN                                 
5   New York  Female   (46.75, 47.0]              1           NaN       1.00         0.5           NaN           0.5         NaN

....after which I will apply a filter as follows for example:

df = df.loc[df['Green_pct'] > 0.5]
df = df.reset_index()

..which would produce a df as below after my filter:

        city  gender      temp_range    Green_count  Purple_count  Green_pct   Green_sum    Purple_sum     Green_max  Purple_max                                                                                                                      
0   New York  Male     (58.75, 59.0]              2           NaN       1.00        88.0           NaN          48.5         NaN                                                                                                                  
1   New York  Male     (59.25, 59.5]              2           NaN       1.00       209.0           NaN         187.5         NaN                                            
2   New York  Female   (46.0, 46.25]              1           NaN       1.00       176.5           NaN         176.5         NaN                                 
3   New York  Female   (46.75, 47.0]              1           NaN       1.00         0.5           NaN           0.5         NaN 

Stuck - Part 2

...and then finally, using the data from the filtered df I would like to output a dictionary which I can use later

The format of the dictionary should be as follows:

temp_dict = {('New York', Male):(58.75,59.0,59.25,59.5),
                ('New York', Female):(46.0, 46.25,46.75,47.0)}  

Upvotes: 1

Views: 46

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195418

Step 1

x = df.pivot_table(
    index=["city", "gender", "temp_range"],
    columns="color",
    values="cm",
    aggfunc={"cm": [lambda x: len(x), lambda x: x.sum(), lambda x: x.max()]},
)
x.columns = x.columns.set_levels(["count", "sum", "max"], level=0)
x = pd.concat(
    [
        x,
        pd.concat(
            {
                "pct": x.xs("count", axis=1).div(
                    x.xs("count", axis=1).sum(1), axis=0
                )
            },
            axis=1,
        ),
    ],
    axis=1,
)
x.columns = x.columns.map(lambda v: "_".join(v[::-1]))

print(x)

Prints:

                               Green_count  Purple_count  Green_sum  Purple_sum  Green_max  Purple_max  Green_pct  Purple_pct
city     gender temp_range                                                                                                   
New York Female (40.5, 40.75]          1.0           2.0      523.5        58.0      523.5        40.5   0.333333    0.666667
                (46.0, 46.25]          1.0           NaN      176.5         NaN      176.5         NaN   1.000000         NaN
                (46.75, 47.0]          1.0           NaN        0.5         NaN        0.5         NaN   1.000000         NaN
         Male   (46.75, 47.0]          1.0           1.0       27.5        12.5       27.5        12.5   0.500000    0.500000
                (58.75, 59.0]          2.0           NaN       88.0         NaN       48.5         NaN   1.000000         NaN
                (59.25, 59.5]          2.0           NaN      209.0         NaN      187.5         NaN   1.000000         NaN

Step 2

x = (
    x.loc[x["Green_pct"] > 0.5]
    .reset_index()
    .groupby(["city", "gender"])["temp_range"]
    .agg(lambda x: [a for v in x for a in [v.left, v.right]])
    .to_dict()
)
print(x)

Prints:

{
    ("New York", "Female"): [46.0, 46.25, 46.75, 47.0],
    ("New York", "Male"): [58.75, 59.0, 59.25, 59.5],
}

Upvotes: 1

Related Questions