Reputation: 3340
I am trying to gather summary statistics to generate a boxplot.
I have a dictionary where the keys are variables to be plotted on the y-axis and the values are their count in the data.
d = {16: 5,
21: 9,
44: 2,
2: 1}
I am wondering if there is a way to generate statistics such as median, Q1, Q3, etc. from the counts alone - I don't want to turn it into a list like [16, 16, 16, 16, 16, 21, 21, ...]
and calculate from that. This is due to me trying to save a considerable amount of memory and not having to store the individual observations in memory.
EDIT
To be more concrete. Given an input
d = {4: 2, 10: 1, 3: 2, 11: 1, 18: 1, 12: 1, 14: 1, 16: 2, 7: 1}
I would like something that outputs
{'q1': 4, 'q2': 10.5, 'q3', 15, 'max': 18, 'min': 3}
Upvotes: 3
Views: 1991
Reputation: 164813
Here is an idea. I have not dealt with all situations (e.g. when median index is not a whole number), but since get_val
returns the result of a generator it should be memory-efficient.
from collections import OrderedDict
from itertools import accumulate
d = {16: 5,
21: 9,
44: 4,
2: 2}
d = OrderedDict(sorted(d.items()))
size = sum(d.values())
idx = {'q1': size/4,
'q2': size/2,
'q3': size*3/4}
# {'q1': 5.0, 'q2': 10.0, 'q3': 15.0}
def get_val(d, i):
return next(k for k, x in zip(d, accumulate(d.values())) if i < x)
res = {k: get_val(d, v) for k, v in idx.items()}
# {'q1': 16, 'q2': 21, 'q3': 21}
Upvotes: 2