Oguz Cebeci
Oguz Cebeci

Reputation: 107

Cutting outliers in Histogram (Python)

I wanted to know, if there is a method that shows me how long my x-axis should be. I have a record with different outliers. I can just cut them with plt.xlim() but is there a statistical method to compute a senseful x-axis limit? In the added picture a logical cut would be after 150 km drived distance. To compute the threshold of the cutting would be perfect logical manual cut after 150 km

The dataframe that the definition gets is a standard pandas dataframe

Code:

def yearly_distribution(dataframe):


    df_distr = dataframe  

    h=sorted(df_distr['Distance'])
    l=len(h)    

    fig, ax =plt.subplots(figsize=(16,9))

    binwidth = np.arange(0,501,0.5)

    n, bins, patches = plt.hist(h, bins=binwidth, normed=1, facecolor='#023d6b', alpha=0.5, histtype='bar')

    lnspc =np.arange(0,500.5,0.5)

    gevfit = gev.fit(h)  
    pdf_gev = gev.pdf(lnspc, *gevfit)  
    plt.plot(lnspc, pdf_gev, label="GEV")

    logfit = stats.lognorm.fit(h)  
    pdf_lognorm = stats.lognorm.pdf(lnspc, *logfit)  
    plt.plot(lnspc, pdf_lognorm, label="LogNormal")

    weibfit = stats.weibull_min.fit(h)  
    pdf_weib = stats.weibull_min.pdf(lnspc, *weibfit)  
    plt.plot(lnspc, pdf_weib, label="Weibull")

    burrfit = stats.burr.fit(h)  
    pdf_burr = stats.burr.pdf(lnspc, *burrfit)  
    plt.plot(lnspc, pdf_burr, label="Burr Distribution")

    genparetofit = stats.genpareto.fit(h)
    pdf_genpareto = stats.genpareto.pdf(lnspc, *genparetofit)
    plt.plot(lnspc, pdf_genpareto, label ="Generalized Pareto")

    myarray = np.array(h)

    clf = GMM(8,n_iter=500, random_state=3)
    myarray.shape = (myarray.shape[0],1)
    clf = clf.fit(myarray)
    lnspc.shape = (lnspc.shape[0],1)
    pdf_gmm = np.exp(clf.score(lnspc))
    plt.plot(lnspc, pdf_gmm, label = "GMM")

    plt.xlim(0,500)
    plt.xlabel('Distance')
    plt.ylabel('Probability')
    plt.title('Histogram')
    plt.ylim(0,0.05)

Upvotes: 3

Views: 10190

Answers (1)

Dadep
Dadep

Reputation: 2788

you should remove outliers from your data before any plot or fitting :

h=sorted(df_distr['Distance'])

out_threshold= 150.0
h=[i for i in h if i<out_threshold]

EDIT that maybe not the fastest way but with numpy.std() :

out_threshold= 2.0*np.std(h+[-a for a in h])

Upvotes: 2

Related Questions