Richard
Richard

Reputation: 65510

Chart cumulative percentage by year in matplotlib?

I have some data that looks like this:

 items_by_year = { 
     2004: 10352,
     2005: 15125,
     2006: 8989 ...
 }

I'm drawing a chart of the cumulative percentage by year in matplotlib like this:

# Get cumulative count. Ugh!
new_dict = {}
for year in items_by_year:
    sum_of_items += items_by_year[year]
    new_dict[year] = items_by_year[year]
    for y in items_by_year:
        if y < year:
            new_dict[year] += items_by_year[y]

# Calculate cumulative percentage.
temp_data = []
for year in new_dict:
    temp_data.append((year.year, (new_dict[year] / sum_of_items) * 100))

# Sort array by year. 
data = sorted(temp_data, key=lambda x: x[0])
x = [date for (date, value) in data]
y = [value for (date, value) in data]

# Draw chart. 
fig = plt.figure()
graph = fig.add_subplot(111)
graph.plot(x, y)
plt.show()

I think there must be a way to make this code nicer, but any suggestions would be very gratefully received!

Upvotes: 1

Views: 4117

Answers (3)

kurtosis
kurtosis

Reputation: 1405

Much simpler way is to just use plt.hist() function with it's parameters cumulative and normed! The value normed=True means percentages and cumulative=1 means exactly what you need. The only point: plt.hist() takes unbinned list, in a form like

[2004, 2004, 2004, ..., 2006, 2006]

So to get your data to this form, I use this kind of transformation (but this step may be unnessesary for you if you already have this raw data before reducing it to what you posted):

items_by_year = { 
 2004: 10352,
 2005: 15125,
 2006: 8989,
 2007: 1500,
 2008: 10000
}
years = sorted(items_by_year.keys())
to_hist = []
for year in items_by_year:
    to_hist.extend([year]*items_by_year[year])

If you already have this data, everything you need is:

plt.hist(to_hist, cumulative=1, normed=True, bins=years+[max(years)+1])
plt.xticks([i+0.5 for i in years], years)
plt.show()

enter image description here

One more addition: you can just as well draw reverse cumulative distribution (that is, percentage of events after the given year), simply by passing cumulative = -1:

enter image description here

Upvotes: 2

Bennett Brown
Bennett Brown

Reputation: 5383

The following will minimize the time spent in the loop. Binding the list of sorted keys will save time and make your code read more clearly. There's no need for the conditional used by user1866935; you have to initialize sum_of_items anyway.

cumulative = {}
sum_of_items = 0
years = sorted(items_by_year) # bind this to plot x values
for year in sorted(items_by_year):
    sum_of_items += items_by_year[year]
    cumulative[year] = sum_of_items
fig, ax = plt.subplots(1, 1)
ax.plot(years, [cumulative[year]/sum_of_items for year in years])
fig.show()

Upvotes: 1

m_papas
m_papas

Reputation: 91

That is much faster when calculating cumulative since you avoid looping over your dict again and again. You can measure the time using something like:

import timeit
start = timeit.timeit()
first=True
years=sorted(items_by_year.keys())
for year in years:
    sum_of_items += items_by_year[year]
    if first:
      new_dict[year]=items_by_year[year]
      first=False
    else:
        new_dict[year]=items_by_year[year]+new_dict[year-1]
end = timeit.timeit()
print end - start

For 43 entries your code ran on my machine in 0.000481843948364 secs and that one runs in 2.59876251221e-05

Hope I helped

Upvotes: 0

Related Questions