Emeeus
Emeeus

Reputation: 5250

Remove rows in data.frame when the entire row values match a regex, or match a group of values

I have a data frame like this (correct values ​​are just an example):

df <- data.frame(a=c(" ","NO_DATA","   "," ",NA,NA,3),
                 b=c("NO_DATA","NO_DATA",""," ",NA,2," "),
                 c=c("NO_DATA","NO_DATA","","",NA,2,3),
                 d=c("NO_DATA","NO_DATA","","",NA,2,3),
                 e=c("  ","NO_DATA","","",NA,2,"NO_DATA"))

        a       b       c       d       e
1         NO_DATA NO_DATA NO_DATA          <- I want to Remove this
2 NO_DATA NO_DATA NO_DATA NO_DATA NO_DATA  <- I want to Remove this
3                                          <- I want to Remove this
4                                          <- I want to Remove this
5    <NA>    <NA>    <NA>    <NA>    <NA>  <- I want to Remove this
6    <NA>       2       2       2       2  <- Preserve
7       3               3       3 NO_DATA  <- Preserve

I need to remove all rows with values: "", " " (or any number of just spaces),NA,"NO_DATA"; but present in all columns in the same row.

I tried using subset, but the logic seems to be wrong since even this:

subset(df, a != "NO_DATA" & b != "NO_DATA")

results in a wrong result:

    a b c d       e
3                  
4                  
7   3   3 3 NO_DATA

This is the result I want:

     a       b       c       d       e   
6    <NA>       2       2       2       2  
7       3               3       3 NO_DATA 

I would like to use a regex beacuse possible values could vary

Upvotes: 0

Views: 67

Answers (2)

lroha
lroha

Reputation: 34406

You can subset using:

df[rowSums(!sapply(df, function(x) trimws(x) %in% c("", "NO_DATA") | is.na(x))) > 0, ]

     a b c d       e
6 <NA> 2 2 2       2
7    3   3 3 NO_DATA

Upvotes: 1

Andrew
Andrew

Reputation: 5138

Here is way to remove using regex (i.e., grepl()) and is.na() to identify NA values.

First uses rapply() to create a TRUE/FALSE dataframe to filter with. Then there are a couple ways to filter from there. Note, that if you have a very large dataframe using rowMeans() will coerce it into a matrix (so that may be slow). Usually not an issue, but an important caveat. Hope this helps!!

# Replace each cell with T/F based on the function (grepl() / is.na())
idx_df <- rapply(df, function(x) grepl("^(\\s*|NO_DATA)$", x) | is.na(x), how = "replace")

# Check if there were no matches in the function (i.e., there is at least one FALSE)
idx <- do.call(pmin, idx_df) == 0

# Check is the rowMean is less than 1 (not all TRUE)
idx2 <- rowMeans(idx_df) < 1

identical(idx, idx2) # same result
[1] TRUE

df[idx, ]
     a b c d       e
6 <NA> 2 2 2       2
7    3   3 3 NO_DATA

Also, if it makes more sense to you conceptually, you can negate (!) grepl() and is.na() inside rapply() and go from there. Whatever is easier to read!

Upvotes: 1

Related Questions