omar
omar

Reputation: 1601

How to get the cumulative distribution function with NumPy?

I want to create a CDF with NumPy, my code is the next:

histo = np.zeros(4096, dtype = np.int32)
for x in range(0, width):
   for y in range(0, height):
      histo[data[x][y]] += 1
      q = 0 
   cdf = list()
   for i in histo:
      q = q + i
      cdf.append(q)

I am walking by the array but take a long time the program execution. There is a built function with this feature, isn't?

Upvotes: 44

Views: 134020

Answers (7)

Jerry
Jerry

Reputation: 85

Minor improvement to @Dan's "Exact" #2 method. I believe the ecdf of the first observation should not be 0, the last one should be 1, also eCDFs are often visualized as step functions (all three are mostly irrelevant for large n).

There was an unanswered question about duplicates, the matplotlib visualize them well, but here is a way to remove them:

x = np.array([3, 3, 3.5, 4, 6, 0, 0.5, 1, 1, 2, 2.5])
# x = np.random.normal(size = 100)

x = np.sort(x)
n = x.shape[0]

# original
y = np.arange(n)/n
plt.plot(x, y, label='original')
plt.plot(x, y, '.', color='tab:red', label='original')

# step (0, 1]
y_step = np.arange(1,n+1)/n
plt.step(x, y_step, where='post', label='step')

# no duplicates
x_unique, inds = np.unique(x, return_index=True)
y_unique = [y_[-1] for y_ in  np.split(y_step, inds[1:])]
plt.step(x_unique, y_unique, '--', where='post', label='step (unique)')
plt.plot(x_unique, y_unique, '.', color='tab:brown', label='step (unique)')

plt.ylim(-0.1, 1.1)
plt.legend()

enter image description here

Upvotes: 1

Andy
Andy

Reputation: 41

The existing answers either resort to using a histogram, or don't handle duplicate values nicely/correctly (either ignoring duplicate values, or yielding a CDF that contains multiple y-values for the same x-value). I suggest the following method:

x, CDF_counts = np.unique(data, return_counts = True)
y = np.cumsum(CDF_counts)/np.sum(CDF_counts)

Upvotes: 4

Alex
Alex

Reputation: 59

To complement Dan's solution. In the case where there are several identical values in your sample, you can use numpy.unique :

Z = np.array([1,1,1,2,2,4,5,6,6,6,7,8,8])
X, F = np.unique(Z, return_index=True)
F=F/X.size

plt.plot(X, F)

Upvotes: 5

user1505725
user1505725

Reputation: 107

I am not sure if there is a ready-made answer, the exact thing to do is to define a function like:

def _cdf(x,data):
    return(sum(x>data))

This will be pretty fast.

Upvotes: -3

Dan
Dan

Reputation: 13373

Using a histogram is one solution but it involves binning the data. This is not necessary for plotting a CDF of empirical data. Let F(x) be the count of how many entries are less than x then it goes up by one, exactly where we see a measurement. Thus, if we sort our samples then at each point we increment the count by one (or the fraction by 1/N) and plot one against the other we will see the "exact" (i.e. un-binned) empirical CDF.

A following code sample demonstrates the method

import numpy as np
import matplotlib.pyplot as plt

N = 100
Z = np.random.normal(size = N)
# method 1
H,X1 = np.histogram( Z, bins = 10, normed = True )
dx = X1[1] - X1[0]
F1 = np.cumsum(H)*dx
#method 2
X2 = np.sort(Z)
F2 = np.array(range(N))/float(N)

plt.plot(X1[1:], F1)
plt.plot(X2, F2)
plt.show()

It outputs the following

enter image description here

Upvotes: 108

offwhitelotus
offwhitelotus

Reputation: 1079

update for numpy version 1.9.0. user545424's answer does not work in 1.9.0. This works:

>>> import numpy as np
>>> arr = np.random.randint(0,10,100)
>>> hist, bin_edges = np.histogram(arr, density=True)
>>> hist = array([ 0.16666667,  0.15555556,  0.15555556,  0.05555556,  0.08888889,
    0.08888889,  0.07777778,  0.04444444,  0.18888889,  0.08888889])
>>> hist
array([ 0.1       ,  0.11111111,  0.11111111,  0.08888889,  0.08888889,
    0.15555556,  0.11111111,  0.13333333,  0.1       ,  0.11111111])
>>> bin_edges
array([ 0. ,  0.9,  1.8,  2.7,  3.6,  4.5,  5.4,  6.3,  7.2,  8.1,  9. ])
>>> np.diff(bin_edges)
array([ 0.9,  0.9,  0.9,  0.9,  0.9,  0.9,  0.9,  0.9,  0.9,  0.9])
>>> np.diff(bin_edges)*hist
array([ 0.09,  0.1 ,  0.1 ,  0.08,  0.08,  0.14,  0.1 ,  0.12,  0.09,  0.1 ])
>>> cdf = np.cumsum(hist*np.diff(bin_edges))
>>> cdf
array([ 0.15,  0.29,  0.43,  0.48,  0.56,  0.64,  0.71,  0.75,  0.92,  1.  ])
>>>

Upvotes: 7

user545424
user545424

Reputation: 16179

I'm not really sure what your code is doing, but if you have hist and bin_edges arrays returned by numpy.histogram you can use numpy.cumsum to generate a cumulative sum of the histogram contents.

>>> import numpy as np
>>> hist, bin_edges = np.histogram(np.random.randint(0,10,100), normed=True)
>>> bin_edges
array([ 0. ,  0.9,  1.8,  2.7,  3.6,  4.5,  5.4,  6.3,  7.2,  8.1,  9. ])
>>> hist
array([ 0.14444444,  0.11111111,  0.11111111,  0.1       ,  0.1       ,
        0.14444444,  0.14444444,  0.08888889,  0.03333333,  0.13333333])
>>> np.cumsum(hist)
array([ 0.14444444,  0.25555556,  0.36666667,  0.46666667,  0.56666667,
        0.71111111,  0.85555556,  0.94444444,  0.97777778,  1.11111111])

Upvotes: 27

Related Questions