D. Studer
D. Studer

Reputation: 1875

R: create a function in tidyverse

I have some fake data:

library(tidyverse)
df <- data.frame(id = 1:20,
                 var1 = sample(c(0,1), size = 20, replace = T),
                 var2 = round(runif(20, min = 0, max = 100),0),
                 var3 = round(runif(20, min = 0, max = 100),0),
                 var4 = round(rnorm(20, mean = 50, sd = 20)),
                 var5 = sample(c(1:19, NA), size=20))

Then, I would like to do some checks on these data:. The IDs of the rows that have errors and an error message should be put in a data.frame errors. I would like to call the function using the pipe-operator %>%

### Different checks

# There should be no missing values in var5
df %>% filter(is.na(var5)) %>% add_errors("There are NAs in var5")

# var3 should be greater than var4
df %>% filter(var3 < var4) %>% add_errors("var3 is smaller than var4")

# ... etc.

Then I have to define the function add_errors():

### Define function

errors <- data.frame(id = numeric(), errormessage = character())

add_errors <- function(dat, error){
    errors <<- add_case(errors, id = dat[['id']], errormessage = error)
}

Upvotes: 1

Views: 202

Answers (3)

TimTeaFan
TimTeaFan

Reputation: 18581

I know that this question is about creating a custom function to check for errors. But there is a nice package called {pointblank} which is exactly made for this kind of task.

Instead of setting up a data.frame called error, we can set up an so called "agent" and "interrogate" it to get a nice report. There are several alternative workflows to check for errors which are described on the package's website. Below is one possible way to use the package on your problem.

library(dplyr)
library(pointblank)

df <- data.frame(id = 1:20,
                 var1 = sample(c(0,1), size = 20, replace = T),
                 var2 = round(runif(20, min = 0, max = 100),0),
                 var3 = round(runif(20, min = 0, max = 100),0),
                 var4 = round(rnorm(20, mean = 50, sd = 20)),
                 var5 = sample(c(1:19, NA), size=20))
agent <- df %>%
  create_agent(
    label = "My error checks",
    actions = action_levels(stop_at = 1)
  ) %>%
  col_vals_not_null(var5) %>% 
  col_vals_not_in_set(
    vars(var3_lt_4),
    preconditions = ~ . %>% dplyr::mutate(var3_lt_4 = var3 > var4),
    set = FALSE) %>% 
  interrogate()
  
agent

Upvotes: 1

akrun
akrun

Reputation: 887941

We could either print the error message on the console

add_errors <- function(dat, error) {
    glue::glue("{error} at id: {toString(dat[['id']])}")
   }

-testing

df %>%
    filter(is.na(var5)) %>% 
    add_errors("There are NAs in var5")
#There are NAs in var5 at id: 6

df %>%
   filter(var3 < var4) %>%
   add_errors("var3 is smaller than var4")
#var3 is smaller than var4 at id: 1, 2, 3, 4, 6, 7, 8, 11, 15, 16, 17, 20

Or return a tibble/data.frame with error message as output

add_errors <- function(dat, error) {
     tibble(id = dat[['id']], errormessage = error)
    }
    
df %>%
     filter(is.na(var5)) %>% 
     add_errors("There are NAs in var5")
# A tibble: 1 x 2
#     id errormessage         
#  <int> <chr>                
#1     6 There are NAs in var5

An option is to make use of logger which would make it more flexible to add error, warning, info etc. along with the timestamp

#remotes::install_github('daroczig/logger')
library(logger)
log_layout(layout_glue_colors)
t <- tempfile()
log_appender(appender_file(t))
log_info('Script starting up...')

df %>%
     filter(is.na(var5)) %>%
    {log_error('There are NAs in var5')}
    
df %>%
   filter(var3 < var4) %>%
   {log_error("var3 is smaller than var4")}
cat(readLines(t), sep="\n")
#INFO [2021-02-28 14:28:42] Script starting up...
#ERROR [2021-02-28 14:28:42] There are NAs in var5
#ERROR [2021-02-28 14:28:43] var3 is smaller than var4

unlink(t)

The t is a temporary file, which can also be written into a custom destination folder

Upvotes: 1

Vons
Vons

Reputation: 3335

The following code does something similar to what you are asking. I tried doing it without passing the errors data frame as an argument, but it doesn't end up changing the errors variable outside of the function.

errors=data.frame(id=numeric(), errormessage=character())
add_errors=function(df, errormessage) {
    return(bind_rows(errors, data.frame(id=df$id, errormessage=errormessage)))
}
errors=df %>% filter(is.na(var5)) %>% add_errors("There are NAs in var5") 
errors=df %>% filter(var3 > var4) %>% add_errors("var3 is smaller than var4")

Output:

> print(errors)
  id              errormessage
1  3     There are NAs in var5
2  2 var3 is smaller than var4
3  3 var3 is smaller than var4
4  7 var3 is smaller than var4
5  8 var3 is smaller than var4
6  9 var3 is smaller than var4
7 12 var3 is smaller than var4
8 16 var3 is smaller than var4
9 18 var3 is smaller than var4

Upvotes: 1

Related Questions