Reputation: 5250
I have a data frame like this (correct values are just an example):
df <- data.frame(a=c(" ","NO_DATA"," "," ",NA,NA,3),
b=c("NO_DATA","NO_DATA",""," ",NA,2," "),
c=c("NO_DATA","NO_DATA","","",NA,2,3),
d=c("NO_DATA","NO_DATA","","",NA,2,3),
e=c(" ","NO_DATA","","",NA,2,"NO_DATA"))
a b c d e
1 NO_DATA NO_DATA NO_DATA <- I want to Remove this
2 NO_DATA NO_DATA NO_DATA NO_DATA NO_DATA <- I want to Remove this
3 <- I want to Remove this
4 <- I want to Remove this
5 <NA> <NA> <NA> <NA> <NA> <- I want to Remove this
6 <NA> 2 2 2 2 <- Preserve
7 3 3 3 NO_DATA <- Preserve
I need to remove all rows with values: "", " " (or any number of just spaces),NA
,"NO_DATA"; but present in all columns in the same row.
I tried using subset, but the logic seems to be wrong since even this:
subset(df, a != "NO_DATA" & b != "NO_DATA")
results in a wrong result:
a b c d e
3
4
7 3 3 3 NO_DATA
This is the result I want:
a b c d e
6 <NA> 2 2 2 2
7 3 3 3 NO_DATA
I would like to use a regex beacuse possible values could vary
Upvotes: 0
Views: 67
Reputation: 34406
You can subset using:
df[rowSums(!sapply(df, function(x) trimws(x) %in% c("", "NO_DATA") | is.na(x))) > 0, ]
a b c d e
6 <NA> 2 2 2 2
7 3 3 3 NO_DATA
Upvotes: 1
Reputation: 5138
Here is way to remove using regex (i.e., grepl()
) and is.na()
to identify NA
values.
First uses rapply()
to create a TRUE
/FALSE
dataframe to filter with. Then there are a couple ways to filter from there. Note, that if you have a very large dataframe using rowMeans()
will coerce it into a matrix (so that may be slow). Usually not an issue, but an important caveat. Hope this helps!!
# Replace each cell with T/F based on the function (grepl() / is.na())
idx_df <- rapply(df, function(x) grepl("^(\\s*|NO_DATA)$", x) | is.na(x), how = "replace")
# Check if there were no matches in the function (i.e., there is at least one FALSE)
idx <- do.call(pmin, idx_df) == 0
# Check is the rowMean is less than 1 (not all TRUE)
idx2 <- rowMeans(idx_df) < 1
identical(idx, idx2) # same result
[1] TRUE
df[idx, ]
a b c d e
6 <NA> 2 2 2 2
7 3 3 3 NO_DATA
Also, if it makes more sense to you conceptually, you can negate (!
) grepl()
and is.na()
inside rapply()
and go from there. Whatever is easier to read!
Upvotes: 1