Reputation: 423
A have a dataframe with values:
user value
1 0
2 1
3 4
4 2
5 1
When I'm trying to plot a histogram with density=True
it shows pretty weird result
df.plot(kind='hist', denisty=True)
I know excatly that first bin covers almost 100% of the values. And density in this case should be more than 0.8. But plot shows something about 0.04.
How could that happen? Maybe I get the meaning of density
wrong.
By the way there are abou 800 000 values in dataframe in case it's related. Here is a describe
of the dataframe:
count 795846.000000
mean 5.220350
std 20.600285
min -3.000000
25% 0.000000
50% 0.000000
75% 1.000000
max 247.000000
Upvotes: 3
Views: 2186
Reputation: 3077
If you are interested in probability and not probability density I think you want to use weights
instead of density
. Take a look at this example to see the difference:
df = pd.DataFrame({'x':np.random.normal(loc=5, scale=10, size=80000)})
fig, (ax0, ax1) = plt.subplots(1, 2, figsize=(12, 4))
df.plot(kind='hist', density=True, bins=np.linspace(-100, 100, 30), ax=ax0)
df.plot(kind='hist', bins=np.linspace(-100, 100, 30), weights=np.ones(len(df))/len(df), ax=ax1)
If you use density
you normalize by the area of the plot, instead, if you use weights
, you normalize by the sum of the heights of the bins.
Upvotes: 4
Reputation: 359
You understood the meaning of density wrong. Refer to the documentation of numpy histogram (couldn't find the exact pandas one but is the same mechanism) https://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram.html
"Density ... If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1"
This means that the sum of the histogram areas is one, not the sum of the heights. In particular you will get the probability to be in a bin by multiplying the height by the width of the bin.
Upvotes: 3