Reputation: 319
I am trying to summarise some data and then after filtering I would to output a dictionary based on the final results.
Please see below what I have being able to figure out thus far and an explanation of the steps I am looking for help with. Thanks in advance.
import pandas as pd
import numpy as np
data = [['New York','Female','Green',523.5,40.614256], ['New York','Female','Green',176.5,46.168092],
['New York','Male','Green',39.5,58.970444], ['New York','Female','Purple',40.5,40.739437],
['New York','Male','Purple',12.5,46.854553], ['New York','Male','Green',187.5,59.477795],
['New York','Male','Green',21.5,59.283682], ['New York','Female','Green',0.5,46.772469],
['New York','Male','Green',48.5,58.847725], ['New York','Female','Purple',17.5,40.642568],
['New York','Male','Green',27.5,46.911063]]
df = pd.DataFrame(data, columns = ['city', 'gender', 'color', 'cm ', 'temp'])
#create `cuts` for `temp`
cuts = np.arange(0, 100.25, 0.25) # range for 'cuts'
df['temp_range'] = pd.cut(df.temp, cuts, include_lowest=True)
print(df)
city gender color cm temp temp_range
0 New York Female Green 523.5 40.614256 (40.5, 40.75]
1 New York Female Green 176.5 46.168092 (46.0, 46.25]
2 New York Male Green 39.5 58.970444 (58.75, 59.0]
3 New York Female Purple 40.5 40.739437 (40.5, 40.75]
4 New York Male Purple 12.5 46.854553 (46.75, 47.0]
5 New York Male Green 187.5 59.477795 (59.25, 59.5]
6 New York Male Green 21.5 59.283682 (59.25, 59.5]
7 New York Female Green 0.5 46.772469 (46.75, 47.0]
8 New York Male Green 48.5 58.847725 (58.75, 59.0]
9 New York Female Purple 17.5 40.642568 (40.5, 40.75]
10 New York Male Green 27.5 46.911063 (46.75, 47.0]
At this point is where I become stuck.
Stuck - Part 1
What I want to do next is produce a summary that outputs as follows:
city gender temp_range Green_count Purple_count Green_pct Green_sum Purple_sum Green_max Purple_max
0 New York Male (46.75, 47.0] 1 1 0.50 27.5 12.5 27.5 12.5
1 New York Male (58.75, 59.0] 2 NaN 1.00 88.0 NaN 48.5 NaN
2 New York Male (59.25, 59.5] 2 NaN 1.00 209.0 NaN 187.5 NaN
3 New York Female (40.5, 40.75] 1 2 0.33 523.5 58.0 523.5 40.5
4 New York Female (46.0, 46.25] 1 NaN 1.00 176.5 NaN 176.5 NaN
5 New York Female (46.75, 47.0] 1 NaN 1.00 0.5 NaN 0.5 NaN
....after which I will apply a filter as follows for example:
df = df.loc[df['Green_pct'] > 0.5]
df = df.reset_index()
..which would produce a df
as below after my filter:
city gender temp_range Green_count Purple_count Green_pct Green_sum Purple_sum Green_max Purple_max
0 New York Male (58.75, 59.0] 2 NaN 1.00 88.0 NaN 48.5 NaN
1 New York Male (59.25, 59.5] 2 NaN 1.00 209.0 NaN 187.5 NaN
2 New York Female (46.0, 46.25] 1 NaN 1.00 176.5 NaN 176.5 NaN
3 New York Female (46.75, 47.0] 1 NaN 1.00 0.5 NaN 0.5 NaN
Stuck - Part 2
...and then finally, using the data from the filtered df
I would like to output a dictionary which I can use later
The format of the dictionary should be as follows:
temp_dict = {('New York', Male):(58.75,59.0,59.25,59.5),
('New York', Female):(46.0, 46.25,46.75,47.0)}
Upvotes: 1
Views: 46
Reputation: 195418
x = df.pivot_table(
index=["city", "gender", "temp_range"],
columns="color",
values="cm",
aggfunc={"cm": [lambda x: len(x), lambda x: x.sum(), lambda x: x.max()]},
)
x.columns = x.columns.set_levels(["count", "sum", "max"], level=0)
x = pd.concat(
[
x,
pd.concat(
{
"pct": x.xs("count", axis=1).div(
x.xs("count", axis=1).sum(1), axis=0
)
},
axis=1,
),
],
axis=1,
)
x.columns = x.columns.map(lambda v: "_".join(v[::-1]))
print(x)
Prints:
Green_count Purple_count Green_sum Purple_sum Green_max Purple_max Green_pct Purple_pct
city gender temp_range
New York Female (40.5, 40.75] 1.0 2.0 523.5 58.0 523.5 40.5 0.333333 0.666667
(46.0, 46.25] 1.0 NaN 176.5 NaN 176.5 NaN 1.000000 NaN
(46.75, 47.0] 1.0 NaN 0.5 NaN 0.5 NaN 1.000000 NaN
Male (46.75, 47.0] 1.0 1.0 27.5 12.5 27.5 12.5 0.500000 0.500000
(58.75, 59.0] 2.0 NaN 88.0 NaN 48.5 NaN 1.000000 NaN
(59.25, 59.5] 2.0 NaN 209.0 NaN 187.5 NaN 1.000000 NaN
x = (
x.loc[x["Green_pct"] > 0.5]
.reset_index()
.groupby(["city", "gender"])["temp_range"]
.agg(lambda x: [a for v in x for a in [v.left, v.right]])
.to_dict()
)
print(x)
Prints:
{
("New York", "Female"): [46.0, 46.25, 46.75, 47.0],
("New York", "Male"): [58.75, 59.0, 59.25, 59.5],
}
Upvotes: 1