Reputation: 3036
I have a dictionary like this:
property2region2value = {
'countryA':{
'a': 24,
'b': 56,
'c': 78
},
'countryB':{
'a': 3,
'b': 98
},
'countryC':{
'a': 121,
'b': 12121,
'c': 12989121,
'd':16171
},
'countryD':{
'a': 123,
'b': 1312,
'c': 1231
},
'countryE':{
'a': 1011,
'b': 1911
},
'countryF'{
'a': 1433,
'b': 19829,
'c': 1132,
'd':1791
}
}
And I am trying to create multiple dictionaries that each contains certain statistics about the master dictionary (min, max, std etc) about all the values for that property (a','b','c
etc) across all the countries (e.g. countryA, countryB
) etc.
So for example: {'a': min of 'a' across all countries, 'b': min of 'b' across all countries....}
might be one dictionary.
At the moment, I do a huge loop of code which a) isn't efficient and b) doesn't allow me to quickly calculate statistics e.g. using np.min()
or np.max()
as functions in numpy.
How can I use dictionary comprehension to achieve this? Current code for calculating min
and max
:
for country, property2value in property2region2value.items():
for property,value in property2value.items():
if property not in property2rangeMax:
property2rangeMax[property] = 0
if property2rangeMax[property]<value:
property2rangeMax[property]=value
if property not in property2rangeMin:
property2rangeMin[property] = 0
if property2rangeMin[property]>value:
property2rangeMin[property] = value
Upvotes: 1
Views: 3087
Reputation: 446
Undoubtedly, pandas
is a good choice for this task, while, in a plain python way, you can simply collect all values for each property in one loop.
from collections import defaultdict
xs = defaultdict(list)
for _, vs in property2region2value.items():
for k, v in vs.items():
xs[k].append(v)
# xs: defaultdict(<type 'list'>, {'a': [3, 121, 24, 1433, 123, 1011], 'c': [12989121, 78, 1132, 1231], 'b': [98, 12121, 56, 19829, 1312, 1911], 'd': [16171, 1791]})
Then you can apply statics on each item.
Upvotes: 2
Reputation: 96267
You should use pandas
for this task:
Pandas can help you accomplish what you want:
In [3]: pd.DataFrame(property2region2value)
Out[3]:
countryA countryB countryC countryD countryE countryF
a 24.0 3.0 121 123.0 1011.0 1433
b 56.0 98.0 12121 1312.0 1911.0 19829
c 78.0 NaN 12989121 1231.0 NaN 1132
d NaN NaN 16171 NaN NaN 1791
In [4]: df.apply(np.min, axis=1)
Out[4]:
a 3.0
b 56.0
c 78.0
d 1791.0
dtype: float64
In [5]: df.apply(np.mean, axis=1)
Out[5]:
a 4.525000e+02
b 5.887833e+03
c 3.247890e+06
d 8.981000e+03
dtype: float64
In [6]: mean_dict = df.apply(np.mean, axis=1).to_dict()
In [7]: mean_dict
Out[7]: {'a': 452.5, 'b': 5887.833333333333, 'c': 3247890.5, 'd': 8981.0}
Or, even more easily, you can transpose the DataFrame:
In [20]: df.T
Out[20]:
a b c d
countryA 24.0 56.0 78.0 NaN
countryB 3.0 98.0 NaN NaN
countryC 121.0 12121.0 12989121.0 16171.0
countryD 123.0 1312.0 1231.0 NaN
countryE 1011.0 1911.0 NaN NaN
countryF 1433.0 19829.0 1132.0 1791.0
In [21]: df.T.describe()
Out[21]:
a b c d
count 6.000000 6.000000 4.000000e+00 2.000000
mean 452.500000 5887.833333 3.247890e+06 8981.000000
std 612.768717 8215.770187 6.494154e+06 10168.195513
min 3.000000 56.000000 7.800000e+01 1791.000000
25% 48.250000 401.500000 8.685000e+02 5386.000000
50% 122.000000 1611.500000 1.181500e+03 8981.000000
75% 789.000000 9568.500000 3.248204e+06 12576.000000
max 1433.000000 19829.000000 1.298912e+07 16171.000000
In [22]: df.T.describe().to_dict()
Out[22]:
{'a': {'25%': 48.25,
'50%': 122.0,
'75%': 789.0,
'count': 6.0,
'max': 1433.0,
'mean': 452.5,
'min': 3.0,
'std': 612.76871656441472},
'b': {'25%': 401.5,
'50%': 1611.5,
'75%': 9568.5,
'count': 6.0,
'max': 19829.0,
'mean': 5887.833333333333,
'min': 56.0,
'std': 8215.770187065038},
'c': {'25%': 868.5,
'50%': 1181.5,
'75%': 3248203.5,
'count': 4.0,
'max': 12989121.0,
'mean': 3247890.5,
'min': 78.0,
'std': 6494153.687626767},
'd': {'25%': 5386.0,
'50%': 8981.0,
'75%': 12576.0,
'count': 2.0,
'max': 16171.0,
'mean': 8981.0,
'min': 1791.0,
'std': 10168.195513462553}}
And if you want finer control, you can pick and choose:
In [24]: df.T.describe().loc[['mean','std','min','max'],:]
Out[24]:
a b c d
mean 452.500000 5887.833333 3.247890e+06 8981.000000
std 612.768717 8215.770187 6.494154e+06 10168.195513
min 3.000000 56.000000 7.800000e+01 1791.000000
max 1433.000000 19829.000000 1.298912e+07 16171.000000
In [25]: df.T.describe().loc[['mean','std','min','max'],:].to_dict()
Out[25]:
{'a': {'max': 1433.0,
'mean': 452.5,
'min': 3.0,
'std': 612.76871656441472},
'b': {'max': 19829.0,
'mean': 5887.833333333333,
'min': 56.0,
'std': 8215.770187065038},
'c': {'max': 12989121.0,
'mean': 3247890.5,
'min': 78.0,
'std': 6494153.687626767},
'd': {'max': 16171.0,
'mean': 8981.0,
'min': 1791.0,
'std': 10168.195513462553}}
Then you can very easily achieve whatever you want:
In [8]: df.apply(np.min)
Out[8]:
countryA 24.0
countryB 3.0
countryC 121.0
countryD 123.0
countryE 1011.0
countryF 1132.0
dtype: float64
In [9]: df.apply(np.max)
Out[9]:
countryA 78.0
countryB 98.0
countryC 12989121.0
countryD 1312.0
countryE 1911.0
countryF 19829.0
dtype: float64
In [10]: df.apply(np.std)
Out[10]:
countryA 2.217105e+01
countryB 4.750000e+01
countryC 5.620356e+06
countryD 5.424170e+02
countryE 4.500000e+02
countryF 7.960893e+03
dtype: float64
You can even bring everything back to dictionaries with ease:
In [11]: df.apply(np.min).to_dict()
Out[11]:
{'countryA': 24.0,
'countryB': 3.0,
'countryC': 121.0,
'countryD': 123.0,
'countryE': 1011.0,
'countryF': 1132.0}
Go nuts! All your data-processing needs will be easier:
In [12]: df.describe()
Out[12]:
countryA countryB countryC countryD countryE \
count 3.000000 2.000000 4.000000e+00 3.000000 2.000000
mean 52.666667 50.500000 3.254384e+06 888.666667 1461.000000
std 27.153882 67.175144 6.489829e+06 664.322462 636.396103
min 24.000000 3.000000 1.210000e+02 123.000000 1011.000000
25% 40.000000 26.750000 9.121000e+03 677.000000 1236.000000
50% 56.000000 50.500000 1.414600e+04 1231.000000 1461.000000
75% 67.000000 74.250000 3.259408e+06 1271.500000 1686.000000
max 78.000000 98.000000 1.298912e+07 1312.000000 1911.000000
countryF
count 4.000000
mean 6046.250000
std 9192.447602
min 1132.000000
25% 1357.750000
50% 1612.000000
75% 6300.500000
max 19829.000000
Upvotes: 3