Emma P
Emma P

Reputation: 23

Why are the whiskers not displayed correctly with boxplots?

I would like to plot a boxplot for columns of a dataframe which have percentages and to set the lower limit to 0 and the upper limit to 100 to detect visually the outliers. However I didn't succeed in plotting the whiskers correctly. Here I created a column with random percentages with some outliers.

import random
from random import randint

import matplotlib.pyplot as plt
import pandas as pd

random.seed(42)
lst=[]
for x in range(140):
    x=randint(1,100)
    lst.append(x)
lst.append(-1)
lst.append(300)
lst.append(140)
print(lst)

df = pd.DataFrame({0:lst})

Here is my function:

def boxplot(df,var,lower_limit=None,upper_limit=None):
    
    q1=df[var].quantile(0.25)
    q3=df[var].quantile(0.75)
    iqr=q3-q1
    w1=w2=1.5
    
    if (q1!=q3) and (lower_limit!=None):
        w1=(q1-lower_limit)/iqr
    
    if (q1!=q3) and (upper_limit!=None):
        w2=(upper_limit-q3)/iqr
    
    plt.figure(figsize=(5,5))
    df.boxplot(column=var,whis=(w1,w2))
    plt.show()
    
    print(f'The minimum of {var} is',df[var].min(),'and its maximum is ',df[var].max(),"\n")
    print(f'The first quantile of {var} is ',q1,'its median is ',df[var].median(),'and its third quantile is ',q3,"\n")

I coded boxplot(df,0,lower_limit=0,upper_limit=100) and I had this result: Result of the function

But the whiskers don't go to 100 and I would like to know why.

Upvotes: 2

Views: 829

Answers (1)

Tom
Tom

Reputation: 8790

TLDR: I don't think you can do what you want to do. The whiskers must snap to values within your dataset, and cannot be set arbitrarily.

Here is a good reference post: https://stackoverflow.com/a/65390045/13386979.


First of all, kudos on a nice first post. It is great that you provided code to reproduce your problem 👏 There were a few small syntax errors, see my edit.

My impression is that what you want to do is not possible with the matplotlib boxplot (which is called by df.boxplot). One issue is that the units of the whis parameter (when you pass a pair of floats) are in percentiles. Taken from the documentation:

If a pair of floats, they indicate the percentiles at which to draw the whiskers (e.g., (5, 95)). In particular, setting this to (0, 100) results in whiskers covering the whole range of the data.

When you pass lower_limit=0, upper_limit=100 to your function, you end up with w1 == 0.5490196078431373 and w2 == 0.4117647058823529 (you can add a print statement to verify this). This tells the boxplot to extend whiskers to the 0.5th and 0.4th percentile, which are both very small (the boxplot edges are the 25th to 75th percentile). The latter is smaller than the 75th percentile, so the top whisker is drawn at the upper edge of the box.

It seems that you have based your calculation of w1 and w2 based on this section from the documentation:

If a float, the lower whisker is at the lowest datum above Q1 - whis*(Q3-Q1), and the upper whisker at the highest datum below Q3 + whis*(Q3-Q1), where Q1 and Q3 are the first and third quartiles. The default value of whis = 1.5 corresponds to Tukey's original definition of boxplots.

I say this because if you also print q1 - w1 * iqr and q3 + w2 * iqr within your call, you get 0 and 100 (respectively). But this calculation is only relevant when a single float is passed (not a pair).

But okay, then what can you pass to whis to get the limits to be any arbitrary value? This is the real problem: I don't think this is possible. The percentiles will always be a value in your data set (there is no interpolation between points). Thus, the edges of the whiskers always snap to a point in your dataset. If you have a point near 0 and near 100, you could find the corresponding percentile to place the whisker there. But without a point there, you cannot hack the whis parameter to set the limits arbitrarily.

I think to fully implement what you want, you should look into drawing the boxes and whiskers manually. Though the caution shared in the other post I referenced is also relevant here:

But be aware that this is not a box and whiskers plot anymore, so you should clearly describe what you're plotting here, otherwise people will be mislead.

Upvotes: 1

Related Questions