user10640719
user10640719

Reputation:

How do I normalize a histogram using Matplotlib?

I am trying to generate a histogram using matplotlib. I am reading data from the following file: https://github.com/meghnasubramani/Files/blob/master/class_id.txt

My intent is to generate a histogram with the following bins: 1, 2-5, 5-100, 100-200, 200-1000, >1000.

When I generate the graph it doesn't look nice. I would like to normalize the y axis to (frequency of occurrence in a bin/total items). I tried using the density parameter but whenever I try that my graph ends up completely blank. How do I go about doing this.

How do I get the width's of the bars to be the same, even though the bin ranges are varied?

Is it also possible to specify the ticks on the histogram? I want to have the ticks correspond to the bin ranges.

Graph

import matplotlib.pyplot as plt

FILE_NAME = 'class_id.txt'
class_id = [int(line.rstrip('\n')) for line in open(FILE_NAME)]
num_bins = [1, 2, 5, 100, 200, 1000, max(class_id)]
x = plt.hist(class_id, bins=num_bins, histtype='bar', align='mid', rwidth=0.5, color='b')
print (x)
plt.legend()
plt.xlabel('Items')
plt.ylabel('Frequency')

Upvotes: 0

Views: 2700

Answers (1)

Marsu
Marsu

Reputation: 786

As suggested by importanceofbeingernest, we can use bar charts to plot categorical data and we need to categorize values in bins, for ex with pandas:

import matplotlib.pyplot as plt
import pandas

FILE_NAME = 'class_id.txt'
class_id_file = [int(line.rstrip('\n')) for line in open(FILE_NAME)]

num_bins = [0, 2, 5, 100, 200, 1000, max(class_id_file)]
categories = pandas.cut(class_id_file, num_bins)
df = pandas.DataFrame(class_id_file)
dfg = df.groupby(categories).count()
bins_labels = ["1-2", "2-5", "5-100", "100-200", "200-1000", ">1000"]

plt.bar(range(len(categories.categories)), dfg[0]/len(class_id_file), tick_label=bins_labels)
#plt.bar(range(len(categories.categories)), dfg[0]/len(class_id_file), tick_label=categories.categories)

plt.xlabel('Items')
plt.ylabel('Frequency')

Bar chart with categorical data

Not what you asked for, but you could also stay with histogram and choose logarithm scale to improve readability:

plt.xscale('log')

Upvotes: 0

Related Questions