pbreach
pbreach

Reputation: 17017

How to get data from adjusted quantile plot in R?

I have a data.frame with two columns and am using aq.plot from the mvoutlier package, to identify potential outliers in my 2-D dataset. The only problem is I'm not very happy with the "look" of the plots generated and would like to grab the data they are plotting and make the plot in some other software.

For my specific case the plot is generated by,

library('mvoutlier')

data = read.csv(fp, colClasses=c("NULL",NA,NA))

h = aq.plot(data)

the data.frame, data looks like this:

    pr          tas
1   5.133207    59.24362
2   20.173075   75.81661
3   24.819054   97.31020
4   35.893467   92.11203
5   27.752425   95.70120
6   25.765618   91.14163
7   20.895360   57.30519
8   8.921513    70.31467
9   36.031261   98.24573
10  27.166213   92.79554
11  8.889431    54.48514
12  59.564447   85.69632
13  43.818336   99.36451
14  43.408963   84.23207
15  22.653269   84.89939
16  21.480331   96.18303
17  22.827370   69.97202
18  23.252464   85.08739
19  14.618731   45.30504
20  40.795519   78.56758
21  37.310456   80.30799
22  31.099105   91.31675
23  33.107472   63.07043
24  9.611930    35.62702

And the generated plot looks like this:

enter image description here

So my question is, how can I get the information plotted in the subplot on the top right? By information I mean the x, y coordinates and number associated with each point. It would also be great if there were a way to get the x values at which the two vertical lines are drawn.

I see that the output h from calling the aq.plot() command gives a boolean array stating which points are outliers (TRUE) or not (FALSE) but there appear to be no access to the underlying components of the plot.

Any help would be much appreciated.

Upvotes: 1

Views: 478

Answers (1)

Jota
Jota

Reputation: 17621

It's all in the code for aq.plot. Here is the specific code for the plot in the upper right:

plot(s$x, (1:length(dist))/length(dist), col = 3, xlab = "Ordered squared robust distance", 
        ylab = "Cumulative probability", type = "n")
    text(s$x, (1:length(dist))/length(dist), as.character(s$ix), 
        col = 3, cex = 0.8)
    t <- seq(0, max(dist), by = 0.01)
    lines(t, pchisq(t, df = ncol(x)), col = 6)
    abline(v = delta, col = 5)
    text(x = delta, y = 0.4, paste(100 * (pchisq(delta, df = ncol(x))), 
        "% Quantile", sep = ""), col = 5, pos = 2, srt = 90, 
        cex = 0.8)
    xarw <- arw(x, covr$center, covr$cov, alpha = alpha)
    if (xarw$cn < Inf) {
        abline(v = xarw$cn, col = 4)
        text(x = xarw$cn, y = 0.4, "Adjusted Quantile", col = 4, 
            pos = 4, srt = 90, cex = 0.8)
    }

If you look through the code of the function aq.plot, you will see that you can get the x coordinates and the associated observation this way:

covr <- robustbase::covMcd(data, alpha = 1/2)
dist <- mahalanobis(data, center = covr$center, cov = covr$cov)
s <- sort(dist, index = TRUE)
s$x 
#        22          4          6         10         21         18         15          5         14 
# 0.1152036  0.2181437  0.3148553  0.3255492  0.3752751  0.4076276  0.4661830  0.5299942  0.7093746 
#         9         20          3         16          2         13         17         23         12 
# 0.7564636  0.7756129  0.8838616  1.0807574  1.3059546  1.4891242  1.8606975  2.9690980  3.9152682 
#         8          7          1         11         19 
# 4.0283820  5.0767176  7.4233298  7.9488595 10.3217389 

Then the y coordinates:

(1:length(dist))/length(dist)
#[1] 0.04347826 0.08695652 0.13043478 0.17391304 0.21739130 0.26086957 0.30434783 0.34782609
#[9] 0.39130435 0.43478261 0.47826087 0.52173913 0.56521739 0.60869565 0.65217391 0.69565217
#[17] 0.73913043 0.78260870 0.82608696 0.86956522 0.91304348 0.95652174 1.00000000

You can rebuild that plot directly using the following code altered from above. Reading through this code and following along as you build the plot should help you see where to find each piece of information. Look to the abline calls for info on the vertical lines, and you'll find the values here qchisq(0.975, df = ncol(data)) and here arw(data, covr$center, covr$cov, alpha = 0.05)$cn

 plot(s$x, (1:length(dist))/length(dist), col = 3, xlab = "Ordered squared robust distance", 
        ylab = "Cumulative probability", type = "n")
    text(s$x, (1:length(dist))/length(dist), as.character(s$ix), 
        col = 3, cex = 0.8)
    t <- seq(0, max(dist), by = 0.01)
    lines(t, pchisq(t, df = ncol(data)), col = 6)
    abline(v = qchisq(0.975, df = ncol(data)), col = 5)
    text(x = qchisq(0.975, df = ncol(data)), 
         y = 0.4, paste(100 * (pchisq(qchisq(0.975, df = ncol(data)), df = ncol(data))), 
        "% Quantile", sep = ""), col = 5, pos = 2, srt = 90, 
        cex = 0.8)
    xarw <- arw(data, covr$center, covr$cov, alpha = 0.05)
    if (xarw$cn < Inf) {
        abline(v = xarw$cn, col = 4)
        text(x = xarw$cn, y = 0.4, "Adjusted Quantile", col = 4, 
            pos = 4, srt = 90, cex = 0.8)
    }

Upvotes: 3

Related Questions