Stelu
Stelu

Reputation: 21

How can I remove outliers (numbers 3 standard deviations away from the mean) in each column of a data frame

I have a dataset with participant IDS and 17 different measures for each participant.

I need to remove outliers- numbers that are 3 standard deviations away from the mean both sides. This needs to happen for each column individually.

So far, by using the code below I have managed to add NA to an outlier column for each column, but it doesn't help me much, since I need to be able to either add NA to the column with the rest of the numbers or simply remove the outlier number

Ideally I want to get to a file that looks like this:

ID measure1 measure2 ....measure17
1  10897                  64436
2  184658    1739473
3            75758
4  746483    4327349      3612638
5  6444      36363        46447

Code I have used so far:

phenotypes <- colnames(imaging_data_kept[,2:ncol(imaging_data_kept)])

 for (i in phenotypes){
  Min <- mean(imaging_data_kept[[i]]) - (3*sd(imaging_data_kept[[i]]))
  Max <- mean(imaging_data_kept[[i]]) + (3*sd(imaging_data_kept[[i]]))  
  imaging_data_kept[[paste0(i,"_outliers")]] <- imaging_data_kept[[i]] < 
  Min | imaging_data_kept[[i]] > Max
 }

Sample data:

SubjID M1 M2 M3 M4 M5 
1000496 14898.1 9172 4902 5921.9 1428.2 
1001121 5420.7 2855.5 4144 732.1 4960.2 
1001468 7478.8 3401.4 5143.6 1106.5 4355.5 
1004960 11316.4 8460.1 3953.4 5682.2 1717 
1005040 15052.7 6362.8 3145.2 4593 1214.5  
1005677 17883.3 6705.1 3943.5 4993.1 1373.1 
1006128 6260.8 4274.6 5865 2002.3 4727.1 
1006694 9292.8 3389.9 5141.6 1246.6 4135.7 
1009080 10391.3 8372.1 2921.8 4008.6 860.4 
1010482 9381.5 2743.4 4526.5 1160.4 3655.1 
1011508 15598.5 7365.7 4279.4 6274.1 1757.1 

Upvotes: 2

Views: 5736

Answers (2)

Ben Bolker
Ben Bolker

Reputation: 226577

This will replace values more than 3 SD from the mean with NA:

dd[,-1] <- lapply(dd[,-1],
      function(x) replace(x,abs(scale(x))>3,NA))

(The scale() function computes (x-mean(x))/sd(x); abs(scale(x))>3 should be reasonably self-explanatory; replace() replaces a specified set of indices with the indicated value.)

You can then use na.omit(dd) if you want to drop all rows that contain outliers in any column.

The sample data you gave us doesn't appear to have any outliers (according to your definition) -- I added some.


dd <- read.table(header=TRUE,
                 colClasses=c("character",rep("numeric",5)),
                 text="
SubjID M1 M2 M3 M4 M5 
1000496 14898.1 9172 4902 5921.9 1428.2 
1001121 5420.7 2855.5 4144 732.1 100000
1001468 7478.8 3401.4 5143.6 1106.5 4355.5 
1004960 11316.4 8460.1 3953.4 5682.2 1717 
1005040 15052.7 6362.8 3145.2 4593 1214.5  
1005677 17883.3 6705.1 100000 4993.1 1373.1 
1006128 6260.8 4274.6 5865 2002.3 4727.1 
1006694 9292.8 3389.9 5141.6 1246.6 4135.7 
1009080 10391.3 8372.1 2921.8 4008.6 860.4 
1010482 9381.5 2743.4 4526.5 1000000 3655.1 
1011508 15598.5 7365.7 4279.4 6274.1 1757.1
")

Upvotes: 6

Humpelstielzchen
Humpelstielzchen

Reputation: 6441

I recommend using the boxplot()- function, which calculates outliers. You can acces them in your boxplot-object via boxplot$out or get the quantiles via boxplot$stats. Which is what I'm doing next.

But beware that boxplot does not calculate outliers in terms of 3 standard deviations but with Q1 - 1.5*IQR and Q3 + 1.5*IQR respectively.


library(dplyr) # for the pipe operators

#creating sample data 
df <- data.frame("var1" = c(-20.32, -15.29, rnorm(5,1,1), 11.23, 20.45),
                 "var2" = c(-12.43, -3.12, rnorm(5, 1,1), 10.75, 18.11))

#looks like that
> df
         var1        var2
1 -20.3200000 -12.4300000
2 -15.2900000  -3.1200000
3   0.9950276   1.2645415
4   1.7022687   0.8313770
5   1.8828154  -0.7459769
6   1.2299670   0.5053378
7   0.2749259   2.0239793
8  11.2300000  10.7500000
9  20.4500000  18.1100000

#remove outliers
nooutliers <- lapply(df, function(x) boxplot(df, plot = FALSE)) %>%
                lapply(`[`, "stats") %>% 
                  lapply(range) %>%
                    mapply(function (x,y) !between(x, y[1], y[2]), df, .) %>%
                      as.data.frame %>%
                        mapply(function(x,y) {y[x] <- NA; y},  
                               y = df, x = .)

#looks like this now
> nooutliers
           var1       var2
 [1,]        NA         NA
 [2,]        NA -3.1200000
 [3,] 0.9950276  1.2645415
 [4,] 1.7022687  0.8313770
 [5,] 1.8828154 -0.7459769
 [6,] 1.2299670  0.5053378
 [7,] 0.2749259  2.0239793
 [8,]        NA         NA
 [9,]        NA         NA

This code calculates the range within the whiskers for each column, assigns NA to all values outside of this range and returns a matrix.

I suppose this is what you're looking for.

UPDATE: With 3 standard deviations:

df <- data.frame("var1" = c(-210.32, rnorm(20,1,1), 234.45),
                 "var2" = c(-230.43, rnorm(20, 1,1), 213.11))


phenotypes <- colnames(df)

for (i in phenotypes){
  Min <- mean(df[[i]]) - (3*sd(df[[i]]))
  Max <- mean(df[[i]]) + (3*sd(df[[i]]))  
  df[[i]][df[[i]] < Min | df[[i]] > Max] <- NA}

This adopts your outlier definition.

Upvotes: 3

Related Questions