D. Studer
D. Studer

Reputation: 1875

Define a tidyverse-function

I have a data.frame df and I would like to do some checks on the data. If there's an error (e.g. missing values or non plausible values) I would like to make a list containing the id of the case and the type of error.

# Define an empty data.frame
errors <- data.frame(id = numeric(),
                     message = character())

# Function that stacks all the errors
addErrorMessage(message){
  
  errors <- rbind(errors,   )  # <= what to do here?
  
}

df <- data.frame(id = 1:7,
                 var1 = c(1, 2, 3, 3, 9, 4, 5),
                 var2 = c("A", "A", "B", "C", NA, "D", "A"))


########### List of checks ################
# Check 1: var1 should be smaller than 5
df %>% filter(var1 > 5) %>%
  addErrorMsg(message = "Value of var1 is 5 or greater")

# Check 2: var2 should not be missing
df %>% filter(is.na(var2)) %>%
  addErrorMessage(message = "Value of var2 is missing")

My question is: How can I define a function addErrorMessage() that I can directly use in the tidyverse-workflow? I want to avoid to save the wrong cases to a temporary data.frame for each check and then stack this data.frame on the errors-data.frame using rbind().

Upvotes: 3

Views: 99

Answers (1)

TimTeaFan
TimTeaFan

Reputation: 18581

Your actual problem can probably be solved using the {pointblank} package which contains a lot of functions that help to conduct this and similar tests.

If you are more interested in writing such validation functions yourself, see a very rough draft below.

df <- data.frame(id = 1:7,
                 var1 = c(1, 2, 3, 3, 9, 4, 5),
                 var2 = c("A", "A", "B", "C", NA, "D", "A"))


library(pointblank)

df %>% 
  col_vals_lt(vars(var1),
              value = 5) %>% 
  col_vals_not_null(vars(var2))
#> Error: Exceedance of failed test units where values in `var1` should have been < `5`.
#> The `col_vals_lt()` validation failed beyond the absolute threshold level (1).
#> * failure level (2) >= failure threshold (1)

Created on 2021-08-17 by the reprex package (v2.0.1)

{pointblank} can also generate data validation reports:

agent <- 
  create_agent(
    tbl = df,
    tbl_name = "My data",
    label = "Checking column values",
    actions = action_levels(stop_at = 1)
  ) %>%
  col_vals_lt(vars(var1),
              value = 5) %>% 
  col_vals_not_null(vars(var2)) %>% 
  interrogate()

agent

enter image description here

If you are more interested in writing this kind of functions yourself, below is a very rough draft. It uses the attributes of the underyling data.frame which is not a great solution, since depending on the functions you use in between checks the attributes might get lost. In a package we could use a dedicated environment to capture errors, so in this case we wouldn't need the attributes.

library(dplyr)

df <- data.frame(id = 1:7,
                 var1 = c(10, 2, 3, 3, 9, 4, 5),
                 var2 = c("A", NA, "B", "C", NA, "D", "A"))


check <- function(data, condition, message){ 
  
  exp  <- rlang::enexpr(condition)
  test  <- transmute(data, new = eval(exp))$new
  
  if (any(test)) {
    err_df <- attr(data, "error_df")
    if (is.null(err_df)) {
      attr(data, "error_df") <- data.frame(check   = 1L,
                                           row_nr  = which(test),
                                           message = message)
    } else {
      attr(data, "error_df") <- rbind(err_df,
                                      data.frame(check   = max(err_df$check) + 1L,
                                                 row_nr  = which(test),
                                                 message = message)
      )
    }
  }
  data
}

get_errors <- function(data) {
  print(attr(data,"error_df"))
  invisible(data)
}


df %>% 
  check(condition = var1 > 5,
        message = "Value of var1 is 5 or greater") %>% 
  check(condition = is.na(var2),
        message = "Value of var2 is missing") %>% 
  get_errors
#>   check row_nr                       message
#> 1     1      1 Value of var1 is 5 or greater
#> 2     1      5 Value of var1 is 5 or greater
#> 3     2      2      Value of var2 is missing
#> 4     2      5      Value of var2 is missing

Created on 2021-08-17 by the reprex package (v2.0.1)

Upvotes: 3

Related Questions