user18441
user18441

Reputation: 663

how to detect outliers in the columns of a dataframe? in R

I have a data frame, suppose this:

names<-c("a","a","a","a","a","b","b","b","b","b","c","c","c","c","c","c","c","c")
var1<-c(0.942999593,0.935507266,0.973589623,0.969415912,0.95230801,0.935507266,0.888740961,0.91750551,0.944482672,0.945468585,1.457579147,0.922206277,0.941511433,0.954724791,0.941014244,0.941511433,0.941511433,1.50511433)
var2<-c(-0.012678088,0.014313763,0.001138275,-0.020568206,0.012987126,0.001217192,0.03360358,0.009758172,0.015066932,-0.037879492,0.020471157,0.010738162,0.010952531,0.019377213,0.027140572,0.031116892,-0.018530676,-8.90E-05)
as.data.frame(cbind(names,var1,var2))->df

I would like to convert the outliers to Na in the columns var1 and var2. However I would like to calculate the outliers independently for each category in the column "names". So the outliers for "a" in var1, will be the outliers found using just the first 5 rows in var1.

the way in which I detect the outlier is all values, below or above the quantiles 0.25 and 0.75 respectively.

Is there any easy way to do this in R?

thank you very much in advance.

Tina.

Upvotes: 3

Views: 4683

Answers (1)

Theodore Lytras
Theodore Lytras

Reputation: 3963

Here's how you can do it for var1:

quantiles<-tapply(var1,names,quantile)
minq <- sapply(names, function(x) quantiles[[x]]["25%"])
maxq <- sapply(names, function(x) quantiles[[x]]["75%"])
var1[var1<minq | var1>maxq] <- NA

Repeat the same for var2 (or df$var2).

Upvotes: 6

Related Questions