Reputation: 65510
I have some data that looks like this:
items_by_year = {
2004: 10352,
2005: 15125,
2006: 8989 ...
}
I'm drawing a chart of the cumulative percentage by year in matplotlib like this:
# Get cumulative count. Ugh!
new_dict = {}
for year in items_by_year:
sum_of_items += items_by_year[year]
new_dict[year] = items_by_year[year]
for y in items_by_year:
if y < year:
new_dict[year] += items_by_year[y]
# Calculate cumulative percentage.
temp_data = []
for year in new_dict:
temp_data.append((year.year, (new_dict[year] / sum_of_items) * 100))
# Sort array by year.
data = sorted(temp_data, key=lambda x: x[0])
x = [date for (date, value) in data]
y = [value for (date, value) in data]
# Draw chart.
fig = plt.figure()
graph = fig.add_subplot(111)
graph.plot(x, y)
plt.show()
I think there must be a way to make this code nicer, but any suggestions would be very gratefully received!
Upvotes: 1
Views: 4117
Reputation: 1405
Much simpler way is to just use plt.hist()
function with it's parameters cumulative
and normed
! The value normed=True
means percentages and cumulative=1
means exactly what you need. The only point: plt.hist()
takes unbinned list, in a form like
[2004, 2004, 2004, ..., 2006, 2006]
So to get your data to this form, I use this kind of transformation (but this step may be unnessesary for you if you already have this raw data before reducing it to what you posted):
items_by_year = {
2004: 10352,
2005: 15125,
2006: 8989,
2007: 1500,
2008: 10000
}
years = sorted(items_by_year.keys())
to_hist = []
for year in items_by_year:
to_hist.extend([year]*items_by_year[year])
If you already have this data, everything you need is:
plt.hist(to_hist, cumulative=1, normed=True, bins=years+[max(years)+1])
plt.xticks([i+0.5 for i in years], years)
plt.show()
One more addition: you can just as well draw reverse cumulative distribution (that is, percentage of events after the given year), simply by passing cumulative = -1
:
Upvotes: 2
Reputation: 5383
The following will minimize the time spent in the loop. Binding the list of sorted keys will save time and make your code read more clearly. There's no need for the conditional used by user1866935; you have to initialize sum_of_items anyway.
cumulative = {}
sum_of_items = 0
years = sorted(items_by_year) # bind this to plot x values
for year in sorted(items_by_year):
sum_of_items += items_by_year[year]
cumulative[year] = sum_of_items
fig, ax = plt.subplots(1, 1)
ax.plot(years, [cumulative[year]/sum_of_items for year in years])
fig.show()
Upvotes: 1
Reputation: 91
That is much faster when calculating cumulative since you avoid looping over your dict again and again. You can measure the time using something like:
import timeit
start = timeit.timeit()
first=True
years=sorted(items_by_year.keys())
for year in years:
sum_of_items += items_by_year[year]
if first:
new_dict[year]=items_by_year[year]
first=False
else:
new_dict[year]=items_by_year[year]+new_dict[year-1]
end = timeit.timeit()
print end - start
For 43 entries your code ran on my machine in 0.000481843948364 secs and that one runs in 2.59876251221e-05
Hope I helped
Upvotes: 0