Reputation: 313
I have a CSV that I'm extracting different columns of data from and reading into a function where I'm creating a histogram. The problem I'm having, my histogram is correct for data in one column and doesn't show the right thing for data in a different column. For example, I have a column that contains ages, my histogram works perfectly on this but I have another column that contains populations (larger number compared to age) and the histogram only shows me the number of rows in the column. When I print the Numpy array, it prints out the same as the age data. The only difference that I can see between the two columns, the age column has over 10k rows compared to 558 population rows and the population numbers are 5-6 digits compared to 1-2 digits for age.
Age column (these work in histogram):
82
50
53
67
26
56
50
26
60
26
59
54
25
53
52
67
22
55
57
84
55
74
67
70
59
62
32
Populations Column (histogram represents the number of values in the Population column).
43191
73901
38247
98266
66781
62075
30444
96109
40497
37964
40822
40599
28360
24949
34969
49455
18128
34586
37489
48177
22061
35218
53745
97493
39764
16193
65818
53285
My function is:
def histogram(column_data):
plt.title(col_name)
df = column_data.to_numpy()
af = df.reshape(-1)
plt.hist(af)
plt.show()
Upvotes: 1
Views: 49
Reputation: 7941
Interesting behavior. I also saw it on the first run (i.e. no plot). Once I used bins
inside my plot command the problem went away. It likely has to do with the relative sparsity of your population data relative to the quoted precision, just as you surmised.
import numpy as np
import matplotlib.pyplot as p
pop= [43191,73901,38247,98266,
66781,62075,30444,96109,
40497,37964,40822,40599,
28360,24949,34969,49455,
18128,34586,37489,48177,
22061,35218,53745,97493,
39764,16193,65818,53285]
dat= np.random.rand(1000) # less sparse data
p.figure (figsize=(10,3))
p.subplot(131)
p.hist(pop)
p.subplot(132)
p.hist(pop,bins=100);
p.subplot(133)
p.hist(dat, bins=100);
Upvotes: 1