user97878
user97878

Reputation: 53

How do I make a function in R to check for data errors?

I have a lot of csv files of temperature data which I am importing into R to process. These files look like:

ID   Date.Time          temp1    temp2
1    08/13/17 14:48:18  15.581  -0.423
2    08/13/17 16:48:18  17.510  -0.423
3    08/13/17 18:48:18  15.390  -0.423

Sometimes the temperature readings in columns 3 and 4 are clearly wrong and have to be replaced with NA values. I know that anything over 50 or under -50 is an error. I'd like to just remove these right away. Using

df[,c(3,4)]<- replace(df[,c(3,4)], df[,c(3,4)] >50, NA)
df[,c(3,4)] <- replace(df[,c(3,4)], df[,c(3,4)] < -50, NA)

works but I don't really want to have to repeat this for every file because it seems messy.

I would like to make a function to replace all this like:

df<-remove.errors(df[,c(3,4)])

I've tried:

remove.errors<-function (df) {
  df[,]<- replace(df[,], df[,] > 50, NA)
  df[,]<- replace(df[,], df[,] < -50, NA)
  }

df<-remove.errors(df[,c(3,4)])

This works but unfortunately only keeps the 3rd and 4th columns and the first two disappear. I've played around with this code for far too long and tried some other things which didn't work at all.

I know I'm probably missing something basic. Anyone have any tips on making a function which will replace values in columns 3 and 4 with NAs without changing the first two columns?

Upvotes: 3

Views: 677

Answers (3)

G. Grothendieck
G. Grothendieck

Reputation: 269644

1) Try this. It uses only base R.

clean <- function(x, max = 50, min = -max) replace(x, x > max | x < min, NA)
df[3:4] <- clean(df[3:4])

1a) Alternately we could do this (which does not overwrite df):

transform(df, temp1 = clean(temp1), temp2 = clean(temp2))

2) Adding in magrittr we could do this:

library(magrittr)
df[3:4] %<>% { clean(.) }

3) In dplyr we could do this:

library(dplyr)

df %>% mutate_at(3:4, clean)

Upvotes: 3

Maurits Evers
Maurits Evers

Reputation: 50678

You need to return df in remove.errors; you can also write the replace statement more succinctly using abs:

remove.errors<-function (df) {
    df[]<- replace(df, abs(df) > 50, NA)
    return(df)
}

Or a cleaner/safer alternative using dplyr that takes care of numeric/non-numeric columns

library(dplyr)
df %>% mutate_if(is.numeric, funs(replace(., abs(.) > 50, NA)))

Upvotes: 2

DanY
DanY

Reputation: 6073

In case you have non-numeric columns in your data.frame, you might want this:

remove_errors <- function(df) {
    numcols <- sapply(df, is.numeric)
    df[ , numcols] <- lapply(df[,numcols], function(x) ifelse(abs(x) > 50, NA, x))
    return(df)
}

Here's a test

set.seed(1234)
mydf <- data.frame(
    a = sample(-100:100, 20, T),
    b = sample(30:70, 20, T),
    c = sample(letters, 20, T),
    stringsAsFactors = F
)

remove_errors(mydf)

Upvotes: 2

Related Questions