riversxiao
riversxiao

Reputation: 369

matplotlib hist function argument density not working

plt.hist's density argument does not work.

I tried to use the density argument in the plt.hist function to normalize stock returns in my plot, but it didn't work.

The following code worked fine for me and give me the probability density function which I desired.

import matplotlib
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(19680801)

# example data
mu = 100  # mean of distribution
sigma = 15  # standard deviation of distribution
x = mu + sigma * np.random.randn(437)

num_bins = 50

plt.hist(x, num_bins, density=1)

plt.show()

plot shows density

But when I tried it with stock data, it simply didn't work. The result gave the unnormalized data. I didn't find any abnormal data in my data array.

import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
plt.hist(returns, 50,density = True)
plt.show()
# "returns" is a np array consisting of 360 days of stock returns

density not working

Upvotes: 20

Views: 17418

Answers (5)

Aritra Mandal
Aritra Mandal

Reputation: 113

At first I also thought that this is an issue. I thought that the tick values shown in the y-axis should not be greater than 1. This means the frequency in that bin is greater than the total frequency which simply doesn't make any sense.

After thinking for a while, I understood what's really happening. So what we are expecting it to return is the Probability Distribution Function which is nothing but the (Observed frequency of a bin) / (Total frequency).

But what Matplotlib returns as density is (Observed frequency of a bin) / (Total frequency * length of each bin). If length of each bin is quite less than 1, then density for that particular bin can go beyond 1. But the total area under the histogram remains 1. As, sum(density*bin_length) for all bins = sum(each frequency)/(Total Frequency) = 1.

So the values you are getting are absolutely fine and make sense too.

Upvotes: 1

Marco Wedemeyer
Marco Wedemeyer

Reputation: 366

Another approach, besides that of tvbc, is to change the yticks on the plot.

import matplotlib.pyplot as plt
import numpy as np

steps = 10
bins = np.arange(0, 101, steps)
data = np.random.random(100000) * 100

plt.hist(data, bins=bins, density=True)
yticks = plt.gca().get_yticks()
plt.yticks(yticks, np.round(yticks * steps, 2))
plt.show()

Upvotes: 0

tvbc
tvbc

Reputation: 43

Since this isn't resolved; based on @user14518925's response which is actually correct, this is treating bin width as an actual valid number whereas from my understanding you want each bin to have a width of 1 such that the sum of frequencies is 1. More succinctly, what you're seeing is:

\sum_{i}y_{i}\times\text{bin size} =1

Whereas what you want is:

\sum_{i}y_{i} =1

therefore, all you really need to change is the tick labels on the y-axis. One way to this is to disable the density option :

density = false

and instead divide by the total sample size as such (shown in your example):

import matplotlib
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(19680801)

# example data
mu = 0 # mean of distribution
sigma = 0.0000625  # standard deviation of distribution
x = mu + sigma * np.random.randn(437)

fig = plt.figure()
plt.hist(x, 50, density=False)
locs, _ = plt.yticks() 
print(locs)
plt.yticks(locs,np.round(locs/len(x),3))
plt.show()

Upvotes: 1

user14518925
user14518925

Reputation: 39

It is not a bug. Area of the bars equal to 1. Numbers only seem strange because your bin sizes are small

Upvotes: 3

Ethan Heilman
Ethan Heilman

Reputation: 16928

This is a known issue in Matplotlib.

As stated in Bug Report: The density flag in pyplot.hist() does not work correctly

When density = False, the histogram plot would have counts on the Y-axis. But when density = True, the Y-axis does not mean anything useful. I think a better implementation would plot the PDF as the histogram when density = True.

The developers view this as a feature not a bug since it maintains compatibility with numpy. They have closed several the bug reports about it already with since it is working as intended. Creating even more confusion the example on the matplotlib site appears to show this feature working with the y-axis being assigned a meaningful value.

What you want to do with matplotlib is reasonable but matplotlib will not let you do it that way.

Upvotes: 10

Related Questions