Reputation: 1249
I am trying to plot a simple histogram. I have processed my data as a list: X = [30, 2728, 2894, 2582, 2309, 2396, 2491, 2453, 2382, 2325, 2225, 2359, 2138...]
where every position corresponds to the number of items with that value (so 30 items for 0, 2728 for 1, etc.) If I plot this list as a bar chart I get the desired result but the resolution is too high (i.e. every value is a bucket). What I want to do is to merge buckets so I can get as my X values: 0, 1-10, 10-50, 50-150, 150-500 and as Y values the sum of items in the desired range, so for 0 I will have y value 30, for 1-10 I will have value sum(2728, 2894, 2582, 2309, 2396, 2491, 2453, 2382, 2325, 2225), etc.
I tried this way:
plt.hist(X,bins=[0,1,10])
but I don't get the desired result, I expect to get one bar 0-1 with y=30 and a second bar 1-10 with y=24785, but that's not what it plots.
What's the best way to do this?
Upvotes: 0
Views: 1953
Reputation: 203
You want to merge the buckets into a customized list:
0, 1-10, 10-50, 50-150, 150-500.
Since this is a customized list, I'm not sure if you can directly tell plt.hist
what the bins are. I would suggest manually counting how many values are in each of the customized bins. It greatly helps if you transform your list X
into a NumPy array using np.array()
.
X = np.array([30, 2728, 2894, 2582, 2309, 2396, 2491, 2453, 2382, 2325, 2225, 2359, 2138])
##Customized bin list:
bin_list = np.array( [0,1,10,50,150, 500, np.inf ]) ##Can specify 500 to be inf as well
plot_bin = np.zeros( len(bin_list)-1)
for bin_n in range(len(bin_list)-1):
plot_bin[bin_n] = np.sum( (X >= bin_list[bin_n]) & (X < bin_list[bin_n+1]) )
## Create string version of the buckets to use as labels
str_bin_list_lower = [str(a) for a in bin_list[0: -1 ]]
x_ticks = np.arange(len(bin_list)-1)-0.35
plt.bar( x_ticks, plot_bin)
plt.xticks( x_ticks+0.35, str_bin_list_lower )
Editted: I misunderstood your question. You have a bin list of [0,10,50] and want to add the numbers from [0], [1-10], [10-50], etc. You should be more familiar with how Python indexes elements. For example, range(10)[0:5] = [0,1,2,3,4]
and range(10)[5:10] = [5,6,7,8,9]
. You need to account for this when you make your bin list.
Then the binning process should be:
X = np.array([30, 2728, 2894, 2582, 2309, 2396, 2491, 2453, 2382, 2325, 2225, 2359, 2138])
bin_list = np.array( [0,10,50,150, 500, np.inf ])+1 ##Can specify 500
plot_bin = np.zeros( len(bin_list)-1)
for bin_n in range(len(bin_list)-1):
if bin_n==len(bin_list)-2:
plot_bin[bin_n] = np.sum( X[ bin_list[bin_n]: ] )
else:
plot_bin[bin_n] = np.sum( X[ bin_list[bin_n]:bin_list[bin_n+1]+1] )
plot_bin = np.insert(plot_bin, 0, X[0])
Upvotes: 0
Reputation: 25249
For the way you preprocess data the right way to plot it is:
X = [30, 2728, 2894, 2582, 2309, 2396, 2491, 2453, 2382, 2325, 2225, 2359, 2138]
plt.bar(range(len(X)),X);
However, matplotlib
provides an even easier and more straightforward way to plot a histogram:
x = np.random.randn(1000)
plt.hist(x, bins=30);
If you want a more direct control over binning, you may want to switch to Pandas and try pd.cut
where you can define your own bins:
import pandas as pd
df = pd.DataFrame({'x':np.random.randint(0,100,1000)})
factor = pd.cut(df.x, [1,10,20,100])
df.groupby(factor).apply(lambda x: x.count()).plot(kind='bar', rot=45, legend=0);
Upvotes: 1