Histogram issues between different column datasets

Question

I have a CSV that I'm extracting different columns of data from and reading into a function where I'm creating a histogram. The problem I'm having, my histogram is correct for data in one column and doesn't show the right thing for data in a different column. For example, I have a column that contains ages, my histogram works perfectly on this but I have another column that contains populations (larger number compared to age) and the histogram only shows me the number of rows in the column. When I print the Numpy array, it prints out the same as the age data. The only difference that I can see between the two columns, the age column has over 10k rows compared to 558 population rows and the population numbers are 5-6 digits compared to 1-2 digits for age.

Age column (these work in histogram):

Populations Column (histogram represents the number of values in the Population column).

My function is:

def histogram(column_data):
    plt.title(col_name)
    df = column_data.to_numpy()
    af = df.reshape(-1)
    plt.hist(af)
    plt.show()

roadrunner66 · Accepted Answer

Interesting behavior. I also saw it on the first run (i.e. no plot). Once I used bins inside my plot command the problem went away. It likely has to do with the relative sparsity of your population data relative to the quoted precision, just as you surmised.

import numpy as np
import matplotlib.pyplot as p

pop= [43191,73901,38247,98266,
66781,62075,30444,96109,
40497,37964,40822,40599,
28360,24949,34969,49455,
18128,34586,37489,48177,
22061,35218,53745,97493,
39764,16193,65818,53285]

dat= np.random.rand(1000) # less sparse data

p.figure (figsize=(10,3))

p.subplot(131)
p.hist(pop)
p.subplot(132)
p.hist(pop,bins=100);
p.subplot(133)
p.hist(dat, bins=100);

Histogram issues between different column datasets

Answers (1)

Related Questions