Reputation:
I have a very large data frame with hundreds of variables. I want to delete rows where there is a NULL for variables that are in consecutive columns. The data frame, df, looks something like this:
ID V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
ABC 1 2 3 4 1 2 3 NULL 4 1 AB BC
DEF 2 3 NULL 4 2 3 4 1 2 3 AB BC
GHI NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL AB BC
JKL 3 4 1 2 3 4 1 2 3 4 AB BC
MNO 1 2 3 4 1 NULL 2 3 4 1 AB BC
In this data frame, I want to delete ONLY row df$ID=="GHI" for example, so I get:
ID V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
ABC 1 2 3 4 1 2 3 NULL 4 1 AB BC
DEF 2 3 NULL 4 2 3 4 1 2 3 AB BC
JKL 3 4 1 2 3 4 1 2 3 4 AB BC
MNO 1 2 3 4 1 NULL 2 3 4 1 AB BC
Thanks!
Upvotes: 0
Views: 262
Reputation: 23818
One can use rowSums to count the occurrences of "NULL" and subset the dataframe by retaining only those rows with at most one NULL:
newdf <- df1[rowSums(df1=="NULL")<2,]
#> newdf
# ID V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
#1 ABC 1 2 3 4 1 2 3 NULL 4 1 AB BC
#2 DEF 2 3 NULL 4 2 3 4 1 2 3 AB BC
#4 JKL 3 4 1 2 3 4 1 2 3 4 AB BC
#5 MNO 1 2 3 4 1 NULL 2 3 4 1 AB BC
data:
df1 <- read.table(text="ID V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
ABC 1 2 3 4 1 2 3 NULL 4 1 AB BC
DEF 2 3 NULL 4 2 3 4 1 2 3 AB BC
GHI NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL AB BC
JKL 3 4 1 2 3 4 1 2 3 4 AB BC
MNO 1 2 3 4 1 NULL 2 3 4 1 AB BC",
header=TRUE)
Upvotes: 1
Reputation: 99391
Seems like a job for rle()
.
a <- !apply(df[paste0("V", 1:10)] == "NULL", 1, function(x) {
with(rle(x), any(lengths[values] > 1))
})
df[a, ]
# ID V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
# 1 ABC 1 1 1 1 1 1 1 NULL 1 1 AB BC
# 2 DEF 1 1 NULL 1 1 1 1 1 1 1 AB BC
# 4 JKL 1 1 1 1 1 1 1 1 1 1 AB BC
# 5 MNO 1 1 1 1 1 NULL 1 1 1 1 AB BC
Data:
df <- structure(list(ID = c("ABC", "DEF", "GHI", "JKL", "MNO"), V1 = c("1",
"1", "NULL", "1", "1"), V2 = c("1", "1", "NULL", "1", "1"), V3 = c("1",
"NULL", "NULL", "1", "1"), V4 = c("1", "1", "NULL", "1", "1"),
V5 = c("1", "1", "NULL", "1", "1"), V6 = c("1", "1", "NULL",
"1", "NULL"), V7 = c("1", "1", "NULL", "1", "1"), V8 = c("NULL",
"1", "NULL", "1", "1"), V9 = c("1", "1", "NULL", "1", "1"
), V10 = c("1", "1", "NULL", "1", "1"), V11 = c("AB", "AB",
"AB", "AB", "AB"), V12 = c("BC", "BC", "BC", "BC", "BC")), .Names = c("ID",
"V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10",
"V11", "V12"), class = "data.frame", row.names = c(NA, -5L))
Upvotes: 3
Reputation: 9628
You could try:
df[!apply(df, 1, function(x) sum(sapply(x, function(x) x == "NULL"))>1),]
ID V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1 ABC 1 1 1 1 1 1 1 NULL 1 1 AB BC
2 DEF 1 1 NULL 1 1 1 1 1 1 1 AB BC
4 JKL 1 1 1 1 1 1 1 1 1 1 AB BC
5 MNO 1 1 1 1 1 NULL 1 1 1 1 AB BC
Upvotes: 0
Reputation: 482
If you want consecutive NA or NULL, then you can try,
df[-which(apply( df, 1, function(x) { seq<-which(is.na(x)); ifelse(any(diff(seq)==1),TRUE,FALSE) } )),]
Otherwise use the sum method.
Upvotes: 1