zbeedatm
zbeedatm

Reputation: 679

Best range of dominant values of histogram curve

I have such histogram:

enter image description here

and I have this code that finds the maxima (-21.5 in my case):

from scipy.stats import gaussian_kde

def find_range(column):
    kde = gaussian_kde(column)
    no_samples = len(column)

    samples = np.linspace(column.min(), column.max(), no_samples)
    probs = kde.evaluate(samples)
    maxima_index = probs.argmax()
    maxima = samples[maxima_index]
    
    plt.scatter(samples, probs) #, color='b',linewidths=0.05)
    plt.show()

    return [maxima]

But I need to find the range of the most dominant values of the histogram (in this histogram for example: -30 : -5). Something like, the value from both sides where it's probability is equal to 20% of the maxima probability.

How can I achieve it? I had tried the following:

t_right = list(filter(lambda tup:np.logical_and(tup[1] > maxima , probs[tup[0]] <= max(probs)*0.2), enumerate(samples)))

but getting many values, I want only one value that cut the curve

Upvotes: 3

Views: 254

Answers (2)

zbeedatm
zbeedatm

Reputation: 679

This is my solution, will be glad to get other ideas:

from scipy.stats import gaussian_kde

def find_range(column):
    kde = gaussian_kde(column)
    no_samples = len(column)

    samples = np.linspace(column.min(), column.max(), no_samples)
    probs = kde.evaluate(samples)
    maxima_index = probs.argmax()
    maxima = samples[maxima_index]
    
    t_right_list = list(filter(lambda tup:np.logical_and(tup[1] > maxima , math.isclose(probs[tup[0]],  max(probs)*0.2, abs_tol=0.00001) ), enumerate(samples)))
    t_right = np.median(list(zip(*t_right_list))[1])
    t_left_list = list(filter(lambda tup:np.logical_and(tup[1] < maxima , math.isclose(probs[tup[0]],  max(probs)*0.2, abs_tol=0.00001) ), enumerate(samples)))
    t_left = np.median(list(zip(*t_left_list))[1])
    
    plt.scatter(samples, probs) #, color='b',linewidths=0.05)
    plt.show()

    return [t_left, maxima, t_right]

In case more than one value will be retrieved in t_right/t_left (because of abs_tol param value), then median can be used (in order to get only one value)

Upvotes: 1

Mikolaj
Mikolaj

Reputation: 356

I'm not sure if that is what you are looking for but I've found this article on Towards data Science code form that article is as follow: Link: https://towardsdatascience.com/take-your-histograms-to-the-next-level-using-matplotlib-5f093ad7b9d3


# Plot
    # Plot histogram
avocado.plot(kind = "hist", density = True, alpha = 0.65, bins = 15) # change density to true, because KDE uses density
    # Plot KDE
avocado.plot(kind = "kde")

    # Quantile lines
quant_5, quant_25, quant_50, quant_75, quant_95 = avocado.quantile(0.05), avocado.quantile(0.25), avocado.quantile(0.5), avocado.quantile(0.75), avocado.quantile(0.95)
quants = [[quant_5, 0.6, 0.16], [quant_25, 0.8, 0.26], [quant_50, 1, 0.36],  [quant_75, 0.8, 0.46], [quant_95, 0.6, 0.56]]
for i in quants:
    ax.axvline(i[0], alpha = i[1], ymax = i[2], linestyle = ":")


# X
ax.set_xlabel("Average Price ($)")
    # Limit x range to 0-4
x_start, x_end = 0, 4
ax.set_xlim(x_start, x_end)

# Y
ax.set_ylim(0, 1)
ax.set_yticklabels([])
ax.set_ylabel("")

# Annotations
ax.text(quant_5-.1, 0.17, "5th", size = 10, alpha = 0.8)
ax.text(quant_25-.13, 0.27, "25th", size = 11, alpha = 0.85)
ax.text(quant_50-.13, 0.37, "50th", size = 12, alpha = 1)
ax.text(quant_75-.13, 0.47, "75th", size = 11, alpha = 0.85)
ax.text(quant_95-.25, 0.57, "95th Percentile", size = 10, alpha =.8)

# Overall
ax.grid(False)
ax.set_title("Avocado Prices in U.S. Markets", size = 17, pad = 10)

    # Remove ticks and spines
ax.tick_params(left = False, bottom = False)
for ax, spine in ax.spines.items():
    spine.set_visible(False)
    
plt.show()

The output of above is something like that:

Plot with emphasized information

I hope that could be helpful for you! :)

Upvotes: 3

Related Questions