Aniruddh Sharma
Aniruddh Sharma

Reputation: 51

What does the parameter "bins" signify in dataframe.hist()?

I'm learning ML from a book in which the writer wrote: housing.hist(bins=50, figsize=(20,15)) plt.show() - to draw histogram of the data. In there, I didn't understand the significance and need of bin attribute and how to decide a value for it.

I went on to pandas documentation website (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.hist.html) and still did not understand what does the parameter "bins" mean.

Upvotes: 2

Views: 6281

Answers (1)

ImportanceOfBeingErnest
ImportanceOfBeingErnest

Reputation: 339170

Simple answer: bins should be the number of bars you want to show in your histogram plot.

But let's unwrap the chain: Pandas hist function calls matplotlib's hist function. In contrast to pandas, matplotlib has a verbose docstring,

bins : integer or sequence or ‘auto’, optional

If an integer is given, bins + 1 bin edges are calculated and returned, consistent with numpy.histogram().

If bins is a sequence, gives bin edges, including left edge of first bin and right edge of last bin. In this case, bins is returned unmodified.

All but the last (righthand-most) bin is half-open. In other words, if bins is:

[1, 2, 3, 4] then the first bin is [1, 2) (including 1, but excluding 2) and the second [2, 3). The last bin, however, is [3, 4], which includes 4.

Unequally spaced bins are supported if bins is a sequence.

It should be noted that by default, the numpy default value of 10 bins between the minimum and maximum datapoint are chosen. This means that the data range is divided into 10 equally sized intervals and any value is assigned to one of those 10 bins, adding up to the value of the bin. This value will then be shown as the height of the respective bar in the plot.

Changing the value of bin to some other number, allows you to have more or less of those intervals.

Also, looking at wikipedia may help:

There is no "best" number of bins, and different bin sizes can reveal different features of the data. [...]

Using wider bins where the density of the underlying data points is low reduces noise due to sampling randomness; using narrower bins where the density is high (so the signal drowns the noise) gives greater precision to the density estimation.

In this case "wider bins" would mean a lower number for bins, "narrower bins" translates in a larger number for bins.

Upvotes: 4

Related Questions