Reputation: 641
I'm so new to R that I'm having trouble finding what I need in other peoples' questions. I think my question is so easy that nobody else has bothered to ask it.
What would be the simplest code to create a new data frame which excludes data which are univariate outliers(which I'm defining as points which are 3 SDs from their condition's mean), within their condition, on a certain variable?
I'm embarrassed to show what I've tried but here it is
greaterthan <- mean(dat$var2[dat$condition=="one"]) +
2.5*(sd(dat$var2[dat$condition=="one"]))
lessthan <- mean(dat$var2[dat$condition=="one"]) -
2.5*(sd(dat$var2[dat$condition=="one"]))
withoutliersremovedone1 <-dat$var2[dat$condition=="one"] < greaterthan
and I'm pretty much already stuck there.
Thanks
Upvotes: 5
Views: 5307
Reputation: 93908
> dat <- data.frame(
var1=sample(letters[1:2],10,replace=TRUE),
var2=c(1,2,3,1,2,3,102,3,1,2)
)
> dat
var1 var2
1 b 1
2 a 2
3 a 3
4 a 1
5 b 2
6 b 3
7 a 102 #outlier
8 b 3
9 b 1
10 a 2
Now only return those rows which are not (!
) greater than 2 abs
olute sd
's from the mean
of the variable in question. Obviously change 2 to however many sd
's you want to be the cutoff.
> dat[!(abs(dat$var2 - mean(dat$var2))/sd(dat$var2)) > 2,]
var1 var2
1 b 1
2 a 2
3 a 3
4 a 1
5 b 2
6 b 3 # no outlier
8 b 3 # between here
9 b 1
10 a 2
Or more short-hand using the scale
function:
dat[!abs(scale(dat$var2)) > 2,]
var1 var2
1 b 1
2 a 2
3 a 3
4 a 1
5 b 2
6 b 3
8 b 3
9 b 1
10 a 2
edit
This can be extended to looking within groups using by
do.call(rbind,by(dat,dat$var1,function(x) x[!abs(scale(x$var2)) > 2,] ))
This assumes dat$var1
is your variable defining the group each row belongs to.
Upvotes: 8
Reputation: 368409
I use the winsorize()
function in the robustHD package for this task. Here is its example:
R> example(winsorize)
winsrzR> ## generate data
winsrzR> set.seed(1234) # for reproducibility
winsrzR> x <- rnorm(10) # standard normal
winsrzR> x[1] <- x[1] * 10 # introduce outlier
winsrzR> ## winsorize data
winsrzR> x
[1] -12.070657 0.277429 1.084441 -2.345698 0.429125 0.506056
[7] -0.574740 -0.546632 -0.564452 -0.890038
winsrzR> winsorize(x)
[1] -3.250372 0.277429 1.084441 -2.345698 0.429125 0.506056
[7] -0.574740 -0.546632 -0.564452 -0.890038
winsrzR>
This defaults to median +/- 2 mad, but you can set the parameters for mean +/- 3 sd.
Upvotes: 4