Reputation: 109
I have a data frame that consists of two columns of variables (mixing up validation and testing data). I calculated the standard deviations from the mean for both columns and now, I want to trim the data to remove points outside of the error bars.
How can I remove the points outside of the 'yellow area' where they do not lie within 1 standard deviation of the mean?
For exemplification of my problem, this a small part of the data frame, including the functions used so far.
ppv_dataset <- data.frame(NPVF=c(537428267.18, 593361648.89, 239331813.71, 564188133.09, 309720858.48, 286511353.97, 240790667.83, 484104247.40),
npv=c(406866996.1019452, 679310854.3856647, 3816961.8569191, 685153713.2962445, 677629647.0433271, 450006801.2676973, 192824789.9761059, 492550821.6983585))
x <- apply((ppv_dataset$NPVF/100000000), 2, mean)
x.sd <- apply((ppv_dataset$NPVF/100000000), 2, sd)
y <- apply((ppv_dataset$npv/100000000), 2, mean)
y.sd <- apply((ppv_dataset$npv/100000000), 2, sd)
x_coordinates <- seq(0,8,by=1)
y_coordinates <- seq(0,8,by=1)
# Add error bars
arrows(x0=x_coordinates-x.sd, y0=y_coordinates, x1=x_coordinates+x.sd, y1=y_coordinates, code=3, angle=90, length=0.1)
arrows(x0=y_coordinates, y0=x_coordinates-x.sd, x1=y_coordinates, y1=x_coordinates+x.sd, code=3, angle=90, length=0.1)
Any assistance would be wonderful.
Upvotes: 1
Views: 252
Reputation: 11056
It is a bit hard to follow your example, but this may help. All of your example code fails with your sample data. The apply
command cannot be used with a vector and you scale your statistics by dividing by 1e8 but not your data. This may be what you want. Based on the legend in your first figure, the line is is npv = NPVF
with NPVF
on the x-axis and npv
on the y-axis. That means the vertical and horizontal deviations from the line are equal but with reversed signs for any point. We can add two columns to your data after scaling the data by 1e8:
ppv_dataset <- ppv_dataset/1e8
ppv_dataset$Diff <- with(ppv_dataset, NPVF - npv)
std <- sd(ppv_dataset$Diff)
ppv_dataset$Z <- ppv_dataset$Diff/std
pv_dataset
NPVF npv Diff Z
1 5.3743 4.06867 1.305613 0.697659
2 5.9336 6.79311 -0.859492 -0.459273
3 2.3933 0.03817 2.355149 1.258482
4 5.6419 6.85154 -1.209656 -0.646384
5 3.0972 6.77630 -3.679088 -1.965933
6 2.8651 4.50007 -1.634954 -0.873644
7 2.4079 1.92825 0.479659 0.256307
8 4.8410 4.92551 -0.084466 -0.045135
Diff
is the difference between NPVF
and npv
and Z
is Diff
divided by the standard deviation. Your outliers are rows with an absolute value greater than 1. Those are the two points outside the yellow box in your second figure. The following code removes them:
ppv_dataset[abs(ppv_dataset$Z) < 1, ]
# NPVF npv Diff Z
# 1 5.3743 4.0687 1.305613 0.697659
# 2 5.9336 6.7931 -0.859492 -0.459273
# 4 5.6419 6.8515 -1.209656 -0.646384
# 6 2.8651 4.5001 -1.634954 -0.873644
# 7 2.4079 1.9282 0.479659 0.256307
# 8 4.8410 4.9255 -0.084466 -0.045135
Here is a simple version of your plot:
notout <- abs(ppv_dataset$Z) < 1
out <- abs(ppv_dataset$Z) > 1
plot(ppv_dataset[notout, 1:2], xlim=c(0, 10), ylim=c(0, 10), pch=16, col="blue", asp=1)
points(ppv_dataset[out, 1:2], pch=16, col="red")
abline(a=0, b=1)
bounds <- cbind(x=c(0, 10, 10, 0), y=c(std, 10+std, 10-std, -std))
polygon(bounds, lty=3)
Upvotes: 1