Reputation: 1875
I have a data.frame df
and I would like to do some checks on the data. If there's an error (e.g. missing values or non plausible values) I would like to make a list containing the id of the case and the type of error.
# Define an empty data.frame
errors <- data.frame(id = numeric(),
message = character())
# Function that stacks all the errors
addErrorMessage(message){
errors <- rbind(errors, ) # <= what to do here?
}
df <- data.frame(id = 1:7,
var1 = c(1, 2, 3, 3, 9, 4, 5),
var2 = c("A", "A", "B", "C", NA, "D", "A"))
########### List of checks ################
# Check 1: var1 should be smaller than 5
df %>% filter(var1 > 5) %>%
addErrorMsg(message = "Value of var1 is 5 or greater")
# Check 2: var2 should not be missing
df %>% filter(is.na(var2)) %>%
addErrorMessage(message = "Value of var2 is missing")
My question is: How can I define a function addErrorMessage()
that I can directly use in the tidyverse-workflow? I want to avoid to save the wrong cases to a temporary data.frame for each check and then stack this data.frame on the errors
-data.frame using rbind()
.
Upvotes: 3
Views: 99
Reputation: 18581
Your actual problem can probably be solved using the {pointblank} package which contains a lot of functions that help to conduct this and similar tests.
If you are more interested in writing such validation functions yourself, see a very rough draft below.
df <- data.frame(id = 1:7,
var1 = c(1, 2, 3, 3, 9, 4, 5),
var2 = c("A", "A", "B", "C", NA, "D", "A"))
library(pointblank)
df %>%
col_vals_lt(vars(var1),
value = 5) %>%
col_vals_not_null(vars(var2))
#> Error: Exceedance of failed test units where values in `var1` should have been < `5`.
#> The `col_vals_lt()` validation failed beyond the absolute threshold level (1).
#> * failure level (2) >= failure threshold (1)
Created on 2021-08-17 by the reprex package (v2.0.1)
{pointblank} can also generate data validation reports:
agent <-
create_agent(
tbl = df,
tbl_name = "My data",
label = "Checking column values",
actions = action_levels(stop_at = 1)
) %>%
col_vals_lt(vars(var1),
value = 5) %>%
col_vals_not_null(vars(var2)) %>%
interrogate()
agent
If you are more interested in writing this kind of functions yourself, below is a very rough draft. It uses the attributes of the underyling data.frame
which is not a great solution, since depending on the functions you use in between check
s the attributes might get lost. In a package we could use a dedicated environment to capture errors, so in this case we wouldn't need the attributes.
library(dplyr)
df <- data.frame(id = 1:7,
var1 = c(10, 2, 3, 3, 9, 4, 5),
var2 = c("A", NA, "B", "C", NA, "D", "A"))
check <- function(data, condition, message){
exp <- rlang::enexpr(condition)
test <- transmute(data, new = eval(exp))$new
if (any(test)) {
err_df <- attr(data, "error_df")
if (is.null(err_df)) {
attr(data, "error_df") <- data.frame(check = 1L,
row_nr = which(test),
message = message)
} else {
attr(data, "error_df") <- rbind(err_df,
data.frame(check = max(err_df$check) + 1L,
row_nr = which(test),
message = message)
)
}
}
data
}
get_errors <- function(data) {
print(attr(data,"error_df"))
invisible(data)
}
df %>%
check(condition = var1 > 5,
message = "Value of var1 is 5 or greater") %>%
check(condition = is.na(var2),
message = "Value of var2 is missing") %>%
get_errors
#> check row_nr message
#> 1 1 1 Value of var1 is 5 or greater
#> 2 1 5 Value of var1 is 5 or greater
#> 3 2 2 Value of var2 is missing
#> 4 2 5 Value of var2 is missing
Created on 2021-08-17 by the reprex package (v2.0.1)
Upvotes: 3