Reputation: 46
Good morning,
In Python, I have a dictionary (called packet_size_dist) with the following values:
34 => 0.00909909009099
42 => 0.02299770023
54 => 0.578742125787
58 => 0.211278872113
62 => 0.00529947005299
66 => 0.031796820318
70 => 0.0530946905309
74 => 0.0876912308769
Notice that the sum of the values == 1.
I am attempting to generate a CDF, which I successfully do, but it looks wrong and I am wondering if I am going about generating it incorrectly. The code in question is:
sorted_p = sorted(packet_size_dist.items(), key=operator.itemgetter(0))
yvals = np.arange(len(sorted_p))/float(len(sorted_p))
plt.plot(sorted_p, yvals)
plt.show()
But the resulting graph looks like this:
Which doesn't seem to quite match the values in the dictionary. Any ideas? I also see a vague green line towards the left of the graph, which I don't know what it is. For example, the graph is depicting that a packet size of 70 occurs about 78% of the time, when in my dictionary it is represented as occurring 5% of the time.
Upvotes: 0
Views: 1387
Reputation: 21663
This is NOT a direct answer to your question. However, I thought I should point out that your data arise from a discrete random variable (rather than one that is continuous) and that therefore, representing them with a series of line segments could be somewhat misleading in some contexts. The representation in cumulative distribution function might be overkill. I offer the following simplification.
An 'x' represents truncation. A dot represents the closed end of a closed-open interval.
Here's the code. I didn't think to use np.cumsum
!
import numpy as np
import pylab as pl
from matplotlib import collections as mc
p = [0.00909909009099,0.02299770023,0.578742125787,0.211278872113,0.00529947005299,0.031796820318,0.0530946905309,0.0876912308769]
cumSums = [0] + [sum(p[:i]) for i in range(1,len(p)+1)]
counts = [30,34,42,54,58,62,66,70,74,80]
lines =[[(counts[i],cumSums[i]),(counts[i+1],cumSums[i])] for i in range(-1+len(counts))]
lc = mc.LineCollection(lines, linewidths=2)
fig, ax = pl.subplots()
ax.add_collection(lc)
pl.plot([30, 80],[0, 1],'bx')
pl.plot(counts[1:-1], cumSums[1:], 'bo')
ax.autoscale()
ax.margins(0.1)
pl.show()
This is more like the plot you appear to want. (Corrected, I hope.)
For which the code.
import numpy as np
import pylab as pl
from matplotlib import collections as mc
from sys import exit
p = [0.00909909009099,0.02299770023,0.578742125787,0.211278872113,0.00529947005299,0.031796820318,0.0530946905309,0.0876912308769]
cumSums = [sum(p[:i]) for i in range(1,len(p)+1)]
counts = [34,42,54,58,62,66,70,74]
lines = [[(counts[i],cumSums[i]),(counts[i+1],cumSums[i+1])] for i in range(-1+len(p))]
lc = mc.LineCollection(lines, linewidths=2)
fig, ax = pl.subplots()
ax.add_collection(lc)
ax.autoscale()
ax.margins(0.1)
pl.show()
Upvotes: 1
Reputation: 339765
Using numpy makes everything a lot easier. So first you may convert your dictionary to a 2-column numpy array. You can then sort this by its first column. Finally simply calculate the cumulative sum of the second column and plot it against the first.
dic = { 34 : 0.00909909009099,
42 : 0.02299770023,
54 : 0.578742125787,
58 : 0.211278872113,
62 : 0.00529947005299,
66 : 0.031796820318,
70 : 0.0530946905309,
74 : 0.0876912308769 }
import numpy as np
import matplotlib.pyplot as plt
data = np.array([[k,v] for k,v in dic.iteritems()]) # use dic.items() for python3
data = data[data[:,0].argsort()]
cdf = np.cumsum(data[:,1])
plt.plot(data[:,0], cdf)
plt.show()
Upvotes: 1