MGP
MGP

Reputation: 2551

R validate, get rows, that violate rules

I'm using the validate package in order to validate a dataframe. I have some rules, which check the datatype and others represent certain contraints, which the data needs to satisfy. My Problem however is, that checking for the datatype is done array-wise and not record-wise. So when I want to get the rows, which violate the rules using violating, I get the error massage

"Error in violating(virg, rules) : Not all rules have record-wise output".

I made a small example illustrating the problem:

library(validate)
library(dplyr)

virg <- filter(iris, Species == "virginica")
virg$Sepal.Length[2] <- "hello"
virg$Sepal.Length[3] <- -3

rules <- validator(
  Sepal.Length > 0
  , is.numeric(Sepal.Length)
)

cf <- confront(virg, rules)
summary(cf)
violating(virg, rules)

I would like to get the rows 2 and 3 as an output, idealy with the information, which rule was violated. Is there an easy way, to force record-wise ouput, when checking for datatypes? how else can I check for violations?

Upvotes: 2

Views: 747

Answers (1)

gregor-fausto
gregor-fausto

Reputation: 660

I came here with this exact same question. After looking at the paper and blog post at the end of this answer, I came up with two options.

The first is to select the rules that have record-wise output.

The second is to use the values function and extract the component of the output that corresponds to the row-wise evaluation of rules. In this case, it's the first element of the list, hence values(cf)[[1]]. Then select any row that fails at least one rule.

library(validate)
library(dplyr)

virg <- filter(iris, Species == "virginica")
virg$Sepal.Length[2] <- "hello"
virg$Sepal.Length[3] <- -3

rules <- validator(
  Sepal.Length > 0
  , is.numeric(Sepal.Length)
)

cf <- confront(virg, rules)
summary(cf)

# option 1
violating(virg, rules[1])

# option 2
out<-values(cf)[[1]]
ifail <- apply(out, 1, all, na.rm=TRUE)
virg[!ifail,]
  1. van der Loo, M. P. J., and E. de Jonge. 2021. Data Validation Infrastructure for R. Journal of Statistical Software 97.
  2. comments on this blog post by an author of the package: https://www.markvanderloo.eu/yaRb/2016/03/25/easy-data-validation-with-the-validate-package/

Upvotes: 2

Related Questions