Sharon Soler
Sharon Soler

Reputation: 109

Subsetting based on standard deviation of the mean

I have a data frame that consists of two columns of variables (mixing up validation and testing data). I calculated the standard deviations from the mean for both columns and now, I want to trim the data to remove points outside of the error bars.

How can I remove the points outside of the 'yellow area' where they do not lie within 1 standard deviation of the mean?

enter image description here

For exemplification of my problem, this a small part of the data frame, including the functions used so far.

ppv_dataset <- data.frame(NPVF=c(537428267.18, 593361648.89, 239331813.71, 564188133.09, 309720858.48, 286511353.97, 240790667.83, 484104247.40), 
                       npv=c(406866996.1019452, 679310854.3856647, 3816961.8569191, 685153713.2962445, 677629647.0433271, 450006801.2676973, 192824789.9761059, 492550821.6983585))

x <- apply((ppv_dataset$NPVF/100000000), 2, mean)
x.sd <- apply((ppv_dataset$NPVF/100000000), 2, sd)
y <- apply((ppv_dataset$npv/100000000), 2, mean)
y.sd <- apply((ppv_dataset$npv/100000000), 2, sd)

x_coordinates <- seq(0,8,by=1)
y_coordinates <- seq(0,8,by=1)

 # Add error bars

arrows(x0=x_coordinates-x.sd, y0=y_coordinates, x1=x_coordinates+x.sd, y1=y_coordinates, code=3, angle=90, length=0.1)
arrows(x0=y_coordinates, y0=x_coordinates-x.sd, x1=y_coordinates, y1=x_coordinates+x.sd, code=3, angle=90, length=0.1)

enter image description here Any assistance would be wonderful.

Upvotes: 1

Views: 252

Answers (1)

dcarlson
dcarlson

Reputation: 11056

It is a bit hard to follow your example, but this may help. All of your example code fails with your sample data. The apply command cannot be used with a vector and you scale your statistics by dividing by 1e8 but not your data. This may be what you want. Based on the legend in your first figure, the line is is npv = NPVF with NPVF on the x-axis and npv on the y-axis. That means the vertical and horizontal deviations from the line are equal but with reversed signs for any point. We can add two columns to your data after scaling the data by 1e8:

ppv_dataset <- ppv_dataset/1e8
ppv_dataset$Diff <- with(ppv_dataset, NPVF - npv)
std <- sd(ppv_dataset$Diff)
ppv_dataset$Z <- ppv_dataset$Diff/std
pv_dataset
    NPVF     npv      Diff         Z
1 5.3743 4.06867  1.305613  0.697659
2 5.9336 6.79311 -0.859492 -0.459273
3 2.3933 0.03817  2.355149  1.258482
4 5.6419 6.85154 -1.209656 -0.646384
5 3.0972 6.77630 -3.679088 -1.965933
6 2.8651 4.50007 -1.634954 -0.873644
7 2.4079 1.92825  0.479659  0.256307
8 4.8410 4.92551 -0.084466 -0.045135

Diff is the difference between NPVF and npv and Z is Diff divided by the standard deviation. Your outliers are rows with an absolute value greater than 1. Those are the two points outside the yellow box in your second figure. The following code removes them:

ppv_dataset[abs(ppv_dataset$Z) < 1, ]
#     NPVF    npv      Diff         Z
# 1 5.3743 4.0687  1.305613  0.697659
# 2 5.9336 6.7931 -0.859492 -0.459273
# 4 5.6419 6.8515 -1.209656 -0.646384
# 6 2.8651 4.5001 -1.634954 -0.873644
# 7 2.4079 1.9282  0.479659  0.256307
# 8 4.8410 4.9255 -0.084466 -0.045135

Here is a simple version of your plot:

notout <- abs(ppv_dataset$Z) < 1
out <- abs(ppv_dataset$Z) > 1
plot(ppv_dataset[notout, 1:2], xlim=c(0, 10), ylim=c(0, 10), pch=16, col="blue", asp=1)
points(ppv_dataset[out, 1:2], pch=16, col="red")
abline(a=0, b=1)
bounds <- cbind(x=c(0, 10, 10, 0), y=c(std, 10+std, 10-std, -std))
polygon(bounds, lty=3)

Plot

Upvotes: 1

Related Questions