Reputation: 1601
I want to create a CDF with NumPy, my code is the next:
histo = np.zeros(4096, dtype = np.int32)
for x in range(0, width):
for y in range(0, height):
histo[data[x][y]] += 1
q = 0
cdf = list()
for i in histo:
q = q + i
cdf.append(q)
I am walking by the array but take a long time the program execution. There is a built function with this feature, isn't?
Upvotes: 44
Views: 134020
Reputation: 85
Minor improvement to @Dan's "Exact" #2 method. I believe the ecdf of the first observation should not be 0, the last one should be 1, also eCDFs are often visualized as step functions (all three are mostly irrelevant for large n
).
There was an unanswered question about duplicates, the matplotlib visualize them well, but here is a way to remove them:
x = np.array([3, 3, 3.5, 4, 6, 0, 0.5, 1, 1, 2, 2.5])
# x = np.random.normal(size = 100)
x = np.sort(x)
n = x.shape[0]
# original
y = np.arange(n)/n
plt.plot(x, y, label='original')
plt.plot(x, y, '.', color='tab:red', label='original')
# step (0, 1]
y_step = np.arange(1,n+1)/n
plt.step(x, y_step, where='post', label='step')
# no duplicates
x_unique, inds = np.unique(x, return_index=True)
y_unique = [y_[-1] for y_ in np.split(y_step, inds[1:])]
plt.step(x_unique, y_unique, '--', where='post', label='step (unique)')
plt.plot(x_unique, y_unique, '.', color='tab:brown', label='step (unique)')
plt.ylim(-0.1, 1.1)
plt.legend()
Upvotes: 1
Reputation: 41
The existing answers either resort to using a histogram, or don't handle duplicate values nicely/correctly (either ignoring duplicate values, or yielding a CDF that contains multiple y-values for the same x-value). I suggest the following method:
x, CDF_counts = np.unique(data, return_counts = True)
y = np.cumsum(CDF_counts)/np.sum(CDF_counts)
Upvotes: 4
Reputation: 59
To complement Dan's solution. In the case where there are several identical values in your sample, you can use numpy.unique :
Z = np.array([1,1,1,2,2,4,5,6,6,6,7,8,8])
X, F = np.unique(Z, return_index=True)
F=F/X.size
plt.plot(X, F)
Upvotes: 5
Reputation: 107
I am not sure if there is a ready-made answer, the exact thing to do is to define a function like:
def _cdf(x,data):
return(sum(x>data))
This will be pretty fast.
Upvotes: -3
Reputation: 13373
Using a histogram is one solution but it involves binning the data. This is not necessary for plotting a CDF of empirical data. Let F(x)
be the count of how many entries are less than x
then it goes up by one, exactly where we see a measurement. Thus, if we sort our samples then at each point we increment the count by one (or the fraction by 1/N) and plot one against the other we will see the "exact" (i.e. un-binned) empirical CDF.
A following code sample demonstrates the method
import numpy as np
import matplotlib.pyplot as plt
N = 100
Z = np.random.normal(size = N)
# method 1
H,X1 = np.histogram( Z, bins = 10, normed = True )
dx = X1[1] - X1[0]
F1 = np.cumsum(H)*dx
#method 2
X2 = np.sort(Z)
F2 = np.array(range(N))/float(N)
plt.plot(X1[1:], F1)
plt.plot(X2, F2)
plt.show()
It outputs the following
Upvotes: 108
Reputation: 1079
update for numpy version 1.9.0. user545424's answer does not work in 1.9.0. This works:
>>> import numpy as np
>>> arr = np.random.randint(0,10,100)
>>> hist, bin_edges = np.histogram(arr, density=True)
>>> hist = array([ 0.16666667, 0.15555556, 0.15555556, 0.05555556, 0.08888889,
0.08888889, 0.07777778, 0.04444444, 0.18888889, 0.08888889])
>>> hist
array([ 0.1 , 0.11111111, 0.11111111, 0.08888889, 0.08888889,
0.15555556, 0.11111111, 0.13333333, 0.1 , 0.11111111])
>>> bin_edges
array([ 0. , 0.9, 1.8, 2.7, 3.6, 4.5, 5.4, 6.3, 7.2, 8.1, 9. ])
>>> np.diff(bin_edges)
array([ 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9])
>>> np.diff(bin_edges)*hist
array([ 0.09, 0.1 , 0.1 , 0.08, 0.08, 0.14, 0.1 , 0.12, 0.09, 0.1 ])
>>> cdf = np.cumsum(hist*np.diff(bin_edges))
>>> cdf
array([ 0.15, 0.29, 0.43, 0.48, 0.56, 0.64, 0.71, 0.75, 0.92, 1. ])
>>>
Upvotes: 7
Reputation: 16179
I'm not really sure what your code is doing, but if you have hist
and bin_edges
arrays returned by numpy.histogram
you can use numpy.cumsum
to generate a cumulative sum of the histogram contents.
>>> import numpy as np
>>> hist, bin_edges = np.histogram(np.random.randint(0,10,100), normed=True)
>>> bin_edges
array([ 0. , 0.9, 1.8, 2.7, 3.6, 4.5, 5.4, 6.3, 7.2, 8.1, 9. ])
>>> hist
array([ 0.14444444, 0.11111111, 0.11111111, 0.1 , 0.1 ,
0.14444444, 0.14444444, 0.08888889, 0.03333333, 0.13333333])
>>> np.cumsum(hist)
array([ 0.14444444, 0.25555556, 0.36666667, 0.46666667, 0.56666667,
0.71111111, 0.85555556, 0.94444444, 0.97777778, 1.11111111])
Upvotes: 27