Reputation:
I am trying to generate a histogram using matplotlib. I am reading data from the following file: https://github.com/meghnasubramani/Files/blob/master/class_id.txt
My intent is to generate a histogram with the following bins: 1, 2-5, 5-100, 100-200, 200-1000, >1000.
When I generate the graph it doesn't look nice. I would like to normalize the y axis to (frequency of occurrence in a bin/total items). I tried using the density parameter but whenever I try that my graph ends up completely blank. How do I go about doing this.
How do I get the width's of the bars to be the same, even though the bin ranges are varied?
Is it also possible to specify the ticks on the histogram? I want to have the ticks correspond to the bin ranges.
import matplotlib.pyplot as plt
FILE_NAME = 'class_id.txt'
class_id = [int(line.rstrip('\n')) for line in open(FILE_NAME)]
num_bins = [1, 2, 5, 100, 200, 1000, max(class_id)]
x = plt.hist(class_id, bins=num_bins, histtype='bar', align='mid', rwidth=0.5, color='b')
print (x)
plt.legend()
plt.xlabel('Items')
plt.ylabel('Frequency')
Upvotes: 0
Views: 2700
Reputation: 786
As suggested by importanceofbeingernest, we can use bar charts to plot categorical data and we need to categorize values in bins, for ex with pandas:
import matplotlib.pyplot as plt
import pandas
FILE_NAME = 'class_id.txt'
class_id_file = [int(line.rstrip('\n')) for line in open(FILE_NAME)]
num_bins = [0, 2, 5, 100, 200, 1000, max(class_id_file)]
categories = pandas.cut(class_id_file, num_bins)
df = pandas.DataFrame(class_id_file)
dfg = df.groupby(categories).count()
bins_labels = ["1-2", "2-5", "5-100", "100-200", "200-1000", ">1000"]
plt.bar(range(len(categories.categories)), dfg[0]/len(class_id_file), tick_label=bins_labels)
#plt.bar(range(len(categories.categories)), dfg[0]/len(class_id_file), tick_label=categories.categories)
plt.xlabel('Items')
plt.ylabel('Frequency')
Not what you asked for, but you could also stay with histogram and choose logarithm scale to improve readability:
plt.xscale('log')
Upvotes: 0