Captain Trojan
Captain Trojan

Reputation: 2921

Manual plotly boxplot (not from data)

To the best of my knowledge, a plotly boxplot (at least the default version, which should be equivalent to the generally accepted representation of a boxplot) is defined by 5 values:

q(0.0)    # The smallest outlier (= bottom whisker)
q(0.25)   # The first quartile (= bottom box base y-value)
q(0.5)    # The median (= horizontal line inside the box)
q(0.75)   # The third quartile (= top box base y-value)
q(1.0)    # The largest outlier (= top whisker)

These are the numbers that I believe plotly should compute when drawing a boxplot.

I posess a dataset on a server which contains too great (and ever increasing) amount of numbers which I'd like to visualize on a client using a boxplot (multiple datasets -> multiple boxplots actually, but irrelevant in the context of this question). I figured the best way to do this is to precompute those defining numbers, the five-number summary as they like to call it, on the server using special tricks and then simply passing the summary to the client, which can draw the boxplots easily, without me having to neither clog the bandwidth nor having the client do computational work everytime a request for visualization is handled. I hoped I could do this using plotly in js.

Plotly is great and it's (un?)fortunately strongly integrated into my project, so I'd like to avoid having to replace it for another vis tool.

Nonetheless, as far as I know, the plotly boxplot drawing function accepts a list of data (which is the expected use case), not really letting the user use his own precomputed summary allowing for low-level access to the boxplot visualization. I assumed I could bypass this easily though, because for

list_of_data = [A, B, C, D, E]

where A, B, C, D, and E is any permutation of the five-number summary, the boxplot vis should be determined precisely by the summary. I found, however, that this is not the case. Plotly simply does not handle the list of data in this manner for reasons unknown (and unimaginable) to me, so I don't know how to start fixing this issue.

Ultimately, what I'd like to know, is how can I craft an artificial, small dataset (does not have to be in python ofc, I just need the algorithm)

def dataset_for_js_plotly(five_number_summary):
    ...

which results in plotly drawing precisely the boxplot corresponding to the summary, or if there indeed is a way to manually specify how the boxplot should look like in js plotly, and I missed it.

Upvotes: 1

Views: 1834

Answers (2)

Pablo Guerrero
Pablo Guerrero

Reputation: 1054

According to the box trace reference:

The second signature expects users to supply the boxes corresponding Q1, median and Q3 statistics in the q1, median and q3 data arrays respectively.

When using this second signature, you can also directly specify other statistics like the mean, std, lowerfence, upperfence, etc.

For instance,

var data = [
  {
    q1: [3, 1],
    median: [4, 2],
    q3: [5, 3],
    mean: [4.5, 2.5],
    sd: [1, 1],
    lowerfence: [0.5, 0.5],
    upperfence: [9, 8],
    type: 'box'
  }
];

Plotly.newPlot('myDiv', data);

Upvotes: 1

Captain Trojan
Captain Trojan

Reputation: 2921

I found the solution here. The low-level access to boxplot calculations has been previously requested - for odd number of elements, there are multiple approaches to calculating Q1 (Q3 is mirrored). They are the following (sorry for the pseudopython).

Exclusive method:

def get_Q1_exclusive(data):
    N = len(data)
    data_l1 = get_lowest(data, N//2 - 1)
    return median(data_l1)

Inclusive method:

def get_Q1_inclusive(data):
    N = len(data)
    data_l2 = get_lowest(data, N//2)
    return median(data_l2)

Linear method (used by default for some reason):

def get_Q1_linear(data):
    l1 = get_Q1_inclusive(data)
    l2 = get_Q1_exclusive(data)
    return (l1 + l2) / 2

Luckily, it is possible to change the default method by adding a directive to the data parameter called quartilemethod:

Plotly.newPlot( ..., 
[
    {
        y: [2, 4, 5, 10, 11, 11, 11],
        type: 'box',
        quartilemethod: "inclusive" // or "exclusive" or "linear" (default)
    }
],
...
)

The overall solution to the problem is therefore keeping the original input array as it is, specifying min, Q1, med, Q3, and max, while using the inclusive method for Q1/Q3 computation. It works with my code, the change is minimal and the issue is fixed.

Upvotes: 0

Related Questions