Reputation: 823
I plotted a simple data matrix
39 135 249 1 91 8 28 0 0 74 17 65 560
69 0 290 26 254 88 31 0 18 53 4 63 625
66 186 344 0 9 0 0 0 18 54 0 74 554
80 41 393 0 0 0 2 0 6 51 0 65 660
271 112 511 1 0 274 0 0 0 0 16 48 601
88 194 312 0 110 0 0 0 44 13 2 76 624
198 147 367 0 15 0 0 3 9 44 3 39 590
using a standard boxplot (i.e. where whiskers extend 1.5 x IRQ from Q1 and Q3). Each column is a variable, each row an observation.
Nevertheless I obtained two different graphics using R (RStudio 1.0.44) and Matlab2018. In particular, whiskers extend in a different way.
In Matlab I'm using the following code:
% clearing workspace
clear all;
close all;
clc;
%entering in current directory where I find the txt data file
tmp = matlab.desktop.editor.getActive;
cd(fileparts(tmp.Filename));
clear tmp;
%reading data
df = readtable('pippo.txt', 'Delimiter', '\t', 'ReadVariableNames',false);
df = table2array(df)
figure(1);
boxplot(df(:, 1:end-1), 'Whisker', 1.5);
ylim([0 600]);
which produces the following graph:
In R I'm using the following code:
rm(list = ls())
# getting the current directory
working_dir <-dirname(rstudioapi::getActiveDocumentContext()$path)
# setting the working directory where I finf the txt file with data
setwd(working_dir)
df <- read.table("pippo.txt")
jpeg('r_boxplot.jpg')
boxplot(df[,1:12], las=2, ylim=c(0,600), range=1.5)
dev.off()
which produces the following graph:
Observation 1: if I omit the parameters 'whiskers' and 'range' from both scripts I obtain the same graphics; it is expected as 1.5 seems to be the default whiskers value.
Observation 2: both matlab and R seem to read data in the correct way, I mean both workspaces visualise the same matrix
What Am I missing? Which graph should I trust?
Upvotes: 3
Views: 468
Reputation: 30046
From the MATLAB boxplot
documentation:
On each box, the central mark indicates the median, and the bottom and top edges of the box indicate the 25th and 75th percentiles, respectively. The whiskers extend to the most extreme data points not considered outliers, and the outliers are plotted individually using the '+' symbol.
You likely want to check out the outlier computation.
Under the optional 'Whisker'
input (default 1.5), you can see this explanation:
boxplot
draws points as outliers if they are greater thanq3 + w × (q3 – q1)
or less thanq1 – w × (q3 – q1)
, wherew
is the maximum whisker length, andq1
andq3
are the 25th and 75th percentiles of the sample data, respectively.
If you set the 'Whisker'
option to 0.7
, you get the same plot as seen in your R code:
boxplot(df(:, 1:end-1), 'Whisker', 0.7);
The equivalent input for R's boxplot
is range
(docs):
Range: this determines how far the plot whiskers extend out from the box. If range is positive, the whiskers extend to the most extreme data point which is no more than range times the interquartile range from the box. A value of zero causes the whiskers to extend to the data extremes.
This appears to be the same definition as shown above from the MATLAB docs - please refer to Hojo's answer for slightly more detail about the IQR computation.
Upvotes: 1
Reputation: 1091
explanation for R boxplot code
So going through both functions I found that they both appear to be calculating the exact same thing even down to how they define the IQR
R claims to be doing the following for the boxplot
upper whisker = min(max(x), Q_3 + 1.5 * IQR)
lower whisker = max(min(x), Q_1 – 1.5 * IQR)
where IQR = Q_3 – Q_1, the box length.
MATLAB claims to be doing this for their boxplot
p75 + w(p75 – p25)
p25 – w(p75 – p25)
where p25 and p75 are the 25th and 75th percentiles, respectively.
Even how they define whisker extension is the same with Matlab stating
% The plotted whisker extends to the adjacent value, which is the most
% extreme data value that is not an outlier. Set whisker to 0 to give
% no whiskers and to make every point outside of p25 and p75 an outlier.
And R states
Range determines how far the plot whiskers extend out from the box. If range is
positive, the whiskers extend to the most extreme data point which is no more than
range times the interquartile range from the box. A value of zero causes the whiskers
to extend to the data extremes.
Personally, I feel that it has to do with some underlying way the computations are performed. Edit After messing with the code, I can confirm it has everything to do with the underlying computations.
R code
quantile(a,c(.25, .75))
25% 75%
301 380
> 380+1.5*(380-301)
[1] 498.5
> 301-1.5*(380-301)
[1] 182.5
Matlab code
prctile(te,[25,75])
ans =
295.5000 386.5000
W75 = p75 + 1.5*(p75-p25)
W25 = p25 - 1.5*(p75-p25)
W75 =
523
W25 =
159
I used the 3rd column of your data to test and see how the quantiles are being calculated. As you can see the 25% and 75% are not very different but just different enough to result in larger whisker cutoffs in the matlab code.
Upvotes: 2