Dhruv Ghulati
Dhruv Ghulati

Reputation: 3036

Dictionary comprehension to calculate statistics across dict of dicts for each key in inner dicts

I have a dictionary like this:

property2region2value = {
    'countryA':{
        'a': 24,
        'b': 56,
        'c': 78
    },
    'countryB':{
        'a': 3,
        'b': 98
    },
    'countryC':{
        'a': 121,
        'b': 12121,
        'c': 12989121,
        'd':16171
    },
    'countryD':{
        'a': 123,
        'b': 1312,
        'c': 1231
    },
    'countryE':{
        'a': 1011,
        'b': 1911
    },
    'countryF'{
        'a': 1433,
        'b': 19829,
        'c': 1132,
        'd':1791
    }
}

And I am trying to create multiple dictionaries that each contains certain statistics about the master dictionary (min, max, std etc) about all the values for that property (a','b','c etc) across all the countries (e.g. countryA, countryB) etc.

So for example: {'a': min of 'a' across all countries, 'b': min of 'b' across all countries....} might be one dictionary.

At the moment, I do a huge loop of code which a) isn't efficient and b) doesn't allow me to quickly calculate statistics e.g. using np.min() or np.max() as functions in numpy.

How can I use dictionary comprehension to achieve this? Current code for calculating min and max:

for country, property2value in property2region2value.items():
    for property,value in property2value.items():
        if property not in property2rangeMax:
            property2rangeMax[property] = 0
        if property2rangeMax[property]<value:
            property2rangeMax[property]=value
        if property not in property2rangeMin:
            property2rangeMin[property] = 0
        if property2rangeMin[property]>value:
            property2rangeMin[property] = value

Upvotes: 1

Views: 3087

Answers (2)

citaret
citaret

Reputation: 446

Undoubtedly, pandas is a good choice for this task, while, in a plain python way, you can simply collect all values for each property in one loop.

from collections import defaultdict

xs = defaultdict(list)

for _, vs in property2region2value.items():
    for k, v in vs.items():
        xs[k].append(v)

# xs: defaultdict(<type 'list'>, {'a': [3, 121, 24, 1433, 123, 1011], 'c': [12989121, 78, 1132, 1231], 'b': [98, 12121, 56, 19829, 1312, 1911], 'd': [16171, 1791]})

Then you can apply statics on each item.

Upvotes: 2

juanpa.arrivillaga
juanpa.arrivillaga

Reputation: 96267

You should use pandas for this task:

Edit

Pandas can help you accomplish what you want:

In [3]: pd.DataFrame(property2region2value)
Out[3]: 
   countryA  countryB  countryC  countryD  countryE  countryF
a      24.0       3.0       121     123.0    1011.0      1433
b      56.0      98.0     12121    1312.0    1911.0     19829
c      78.0       NaN  12989121    1231.0       NaN      1132
d       NaN       NaN     16171       NaN       NaN      1791

In [4]: df.apply(np.min, axis=1)
Out[4]: 
a       3.0
b      56.0
c      78.0
d    1791.0
dtype: float64

In [5]: df.apply(np.mean, axis=1)
Out[5]: 
a    4.525000e+02
b    5.887833e+03
c    3.247890e+06
d    8.981000e+03
dtype: float64

In [6]: mean_dict = df.apply(np.mean, axis=1).to_dict()

In [7]: mean_dict
Out[7]: {'a': 452.5, 'b': 5887.833333333333, 'c': 3247890.5, 'd': 8981.0}

Or, even more easily, you can transpose the DataFrame:

In [20]: df.T
Out[20]: 
               a        b           c        d
countryA    24.0     56.0        78.0      NaN
countryB     3.0     98.0         NaN      NaN
countryC   121.0  12121.0  12989121.0  16171.0
countryD   123.0   1312.0      1231.0      NaN
countryE  1011.0   1911.0         NaN      NaN
countryF  1433.0  19829.0      1132.0   1791.0

In [21]: df.T.describe()
Out[21]: 
                 a             b             c             d
count     6.000000      6.000000  4.000000e+00      2.000000
mean    452.500000   5887.833333  3.247890e+06   8981.000000
std     612.768717   8215.770187  6.494154e+06  10168.195513
min       3.000000     56.000000  7.800000e+01   1791.000000
25%      48.250000    401.500000  8.685000e+02   5386.000000
50%     122.000000   1611.500000  1.181500e+03   8981.000000
75%     789.000000   9568.500000  3.248204e+06  12576.000000
max    1433.000000  19829.000000  1.298912e+07  16171.000000

In [22]: df.T.describe().to_dict()
Out[22]: 
{'a': {'25%': 48.25,
  '50%': 122.0,
  '75%': 789.0,
  'count': 6.0,
  'max': 1433.0,
  'mean': 452.5,
  'min': 3.0,
  'std': 612.76871656441472},
 'b': {'25%': 401.5,
  '50%': 1611.5,
  '75%': 9568.5,
  'count': 6.0,
  'max': 19829.0,
  'mean': 5887.833333333333,
  'min': 56.0,
  'std': 8215.770187065038},
 'c': {'25%': 868.5,
  '50%': 1181.5,
  '75%': 3248203.5,
  'count': 4.0,
  'max': 12989121.0,
  'mean': 3247890.5,
  'min': 78.0,
  'std': 6494153.687626767},
 'd': {'25%': 5386.0,
  '50%': 8981.0,
  '75%': 12576.0,
  'count': 2.0,
  'max': 16171.0,
  'mean': 8981.0,
  'min': 1791.0,
  'std': 10168.195513462553}}

And if you want finer control, you can pick and choose:

In [24]: df.T.describe().loc[['mean','std','min','max'],:]
Out[24]: 
                a             b             c             d
mean   452.500000   5887.833333  3.247890e+06   8981.000000
std    612.768717   8215.770187  6.494154e+06  10168.195513
min      3.000000     56.000000  7.800000e+01   1791.000000
max   1433.000000  19829.000000  1.298912e+07  16171.000000

In [25]: df.T.describe().loc[['mean','std','min','max'],:].to_dict()
Out[25]: 
{'a': {'max': 1433.0,
       'mean': 452.5,
       'min': 3.0,
       'std': 612.76871656441472},
 'b': {'max': 19829.0,
       'mean': 5887.833333333333,
       'min': 56.0,
       'std': 8215.770187065038},
 'c': {'max': 12989121.0,
       'mean': 3247890.5,
       'min': 78.0,
       'std': 6494153.687626767},
 'd': {'max': 16171.0,
       'mean': 8981.0,
       'min': 1791.0,
       'std': 10168.195513462553}}

From Original Answer

Then you can very easily achieve whatever you want:

In [8]: df.apply(np.min)
Out[8]: 
countryA      24.0
countryB       3.0
countryC     121.0
countryD     123.0
countryE    1011.0
countryF    1132.0
dtype: float64

In [9]: df.apply(np.max)
Out[9]: 
countryA          78.0
countryB          98.0
countryC    12989121.0
countryD        1312.0
countryE        1911.0
countryF       19829.0
dtype: float64

In [10]: df.apply(np.std)
Out[10]: 
countryA    2.217105e+01
countryB    4.750000e+01
countryC    5.620356e+06
countryD    5.424170e+02
countryE    4.500000e+02
countryF    7.960893e+03
dtype: float64

You can even bring everything back to dictionaries with ease:

In [11]: df.apply(np.min).to_dict()
Out[11]: 
{'countryA': 24.0,
 'countryB': 3.0,
 'countryC': 121.0,
 'countryD': 123.0,
 'countryE': 1011.0,
 'countryF': 1132.0}

Go nuts! All your data-processing needs will be easier:

In [12]: df.describe()
Out[12]: 
        countryA   countryB      countryC     countryD     countryE  \
count   3.000000   2.000000  4.000000e+00     3.000000     2.000000   
mean   52.666667  50.500000  3.254384e+06   888.666667  1461.000000   
std    27.153882  67.175144  6.489829e+06   664.322462   636.396103   
min    24.000000   3.000000  1.210000e+02   123.000000  1011.000000   
25%    40.000000  26.750000  9.121000e+03   677.000000  1236.000000   
50%    56.000000  50.500000  1.414600e+04  1231.000000  1461.000000   
75%    67.000000  74.250000  3.259408e+06  1271.500000  1686.000000   
max    78.000000  98.000000  1.298912e+07  1312.000000  1911.000000   

           countryF  
count      4.000000  
mean    6046.250000  
std     9192.447602  
min     1132.000000  
25%     1357.750000  
50%     1612.000000  
75%     6300.500000  
max    19829.000000  

Upvotes: 3

Related Questions