millie0725
millie0725

Reputation: 371

Removing outliers from multiple columns for dataframes in a list

I have a list of dataframes, each with multiple columns that contain outliers that I would like to remove and replace with NA. My dataset is quite large (each dataframe has 11 columns with ~15,000 rows), so I tried my best to create a reproducible example below:

df1 <- data.frame(date_time = c("2019-01-01", "2019-01-02", "2019-01-03", "2019-01-04", "2019-01-05", "2019-01-06", "2019-01-07", "2019-01-08", "2019-01-09", "2019-01-10", "2019-01-11", "2019-01-12", "2019-01-13", "2019-01-14", "2019-01-15","2019-01-16","2019-01-17"),
                  XH_warmed_air_1m = c(25, 23, 26, 30, 10, 15, 12, 0, 1, 5, -15, -12, -6, -1, 537, 435, 300),
                  XH_ambient_air_1m = c(25, 23, 26, 30, 10, 15, 12, 0, 1, 5, -15, -12, -6, -1, 537, 435, 300))
df2 <- data.frame(date_time = c("2019-01-01", "2019-01-02", "2019-01-03", "2019-01-04", "2019-01-05", "2019-01-06", "2019-01-07", "2019-01-08", "2019-01-09", "2019-01-10", "2019-01-11", "2019-01-12", "2019-01-13", "2019-01-14", "2019-01-15","2019-01-16","2019-01-17"),
                  XH_warmed_air_1m = c(25, 23, 26, 30, 10, 15, 12, 0, 1, 5, -15, -12, -6, -1, 537, 435, 300),
                  XH_ambient_air_1m = c(25, 23, 26, 30, 10, 15, 12, 0, 1, 5, -15, -12, -6, -1, 537, 435, 300))
df3 <- data.frame(date_time = c("2019-01-01", "2019-01-02", "2019-01-03", "2019-01-04", "2019-01-05", "2019-01-06", "2019-01-07", "2019-01-08", "2019-01-09", "2019-01-10", "2019-01-11", "2019-01-12", "2019-01-13", "2019-01-14", "2019-01-15","2019-01-16","2019-01-17"),
                  XH_warmed_air_1m = c(25, 23, 26, 30, 10, 15, 12, 0, 1, 5, -15, -12, -6, -1, 537, 435, 300),
                  XH_ambient_air_1m = c(25, 23, 26, 30, 10, 15, 12, 0, 1, 5, -15, -12, -6, -1, 537, 435, 300))

list_df <- list(df1=df1, df2=df2, df3=df3)

I would like to make a function that replaces the outliers from every column that are 3 sd away from the mean with NA. In the given example, the outliers are 537, 435, and 300, but my actual data has a range of outliers. Below is a function I found here that I tried to use for this purpose.

remove_outliers <- function(df){
  columns <- colnames(df)
  for (i in columns){
    Min <- mean(df[[i]]) - (3*sd(df[[i]]))
    Max <- mean(df[[i]]) + (3*sd(df[[i]]))  
    df[[i]][df[[i]] < Min | df[[i]] > Max] <- NA
  }
  return(df)
}

list_df <- lapply(list_df, remove_outliers)

When I try to apply the function to the list, it doesn't seem to do anything. How can I fix this function so that all columns (except the date_time column) in each of the dataframes in the list have their outliers removed?

Using R version 3.5.1, Mac OS X 10.13.6

Upvotes: 0

Views: 679

Answers (1)

Haci Duru
Haci Duru

Reputation: 456

I think @Duck's comment is very useful here. When you calculate the mean and the sd using the entire dataset, you are including the outliers to the calculation. That will not remove the three outliers in your example. You should restrict your data somehow before calculating the mean and the sd, and then based on those calculations, you can remove outliers. That is, you should remove some cases from the high/low end of the range. The question is, how many (or what proportion of) the cases you will exclude before calculating the mean and the sd? There, you can use the quantile function. Here is how I modified your function:

remove_outliers = function(df) {
    for (i in 2:ncol(df)) {
        dat = df[which(df[,i] > quantile(df[,i], .1) & df[,i] < quantile(df[,i], .9)),i]
        mean = mean(dat)
        sd = sd(dat)
        df[which(   abs((df[,i]) - mean) > (sd * 3)), i] = NA
    }
    return(df)
}

And here is the outcome when you apply that function to df1:

> remove_outliers(df1)
    date_time XH_warmed_air_1m XH_ambient_air_1m
1  2019-01-01               25                25
2  2019-01-02               23                23
3  2019-01-03               26                26
4  2019-01-04               30                30
5  2019-01-05               10                10
6  2019-01-06               15                15
7  2019-01-07               12                12
8  2019-01-08                0                 0
9  2019-01-09                1                 1
10 2019-01-10                5                 5
11 2019-01-11              -15               -15
12 2019-01-12              -12               -12
13 2019-01-13               -6                -6
14 2019-01-14               -1                -1
15 2019-01-15               NA                NA
16 2019-01-16               NA                NA
17 2019-01-17               NA                NA

Also, as @dcarlson said, you are applying the function to a date_time column. I excluded that column from the function.

Upvotes: 1

Related Questions