Reputation: 1695
I have a data set like:
ID EMAIL_ID.x ID.y EMAIL_ID.y
60842 [email protected] 60842 [email protected]
90853 a.b.c NA <NA>
90854 b.c.d NA <NA>
83907 [email protected] 02854 <NA>
83908 [email protected] 00952 <NA>
83909 [email protected] 78895 <NA>
In this data, i only want to retain the rows in which both the 3rd and the 4th column are either populated or both not populated.
Hence the result should be like:
ID EMAIL_ID.x ID.y EMAIL_ID.y
60842 [email protected] 60842 [email protected]
90853 a.b.c NA <NA>
90854 b.c.d NA <NA>
In the above, the rows where column 3 had values but column 4 did not, have been deleted.
Please suggest how can this be done in R.
Thankyou!
Upvotes: 0
Views: 121
Reputation: 887951
One option is to check whether there are NA elements (is.na(...)
) in both (&
) columns or (|
) both non-NA elements (by negating !
) and use that logical index to subset the rows.
df1[is.na(df1[,3]) & is.na(df1[,4]) | !is.na(df1[,3]) & !is.na(df1[,4]),]
# ID EMAIL_ID.x JPNUMBER.y EMAIL_ID.y
#1 60842 [email protected] 60842 [email protected]
#2 90853 a.b.c NA <NA>
#3 90854 b.c.d NA <NA>
Or another option is to apply rowSums
on the logical matrix (is.na(df1[3:4])
), check if the sum of NA elements are either 0 or 2 using %in%
(if it is 0 - it implies there are no NA elements and 2 means both the columns have NA) and use the logical vector to subset the rows.
df1[rowSums(is.na(df1[3:4])) %in% c(0,2),]
# ID EMAIL_ID.x JPNUMBER.y EMAIL_ID.y
#1 60842 [email protected] 60842 [email protected]
#2 90853 a.b.c NA <NA>
#3 90854 b.c.d NA <NA>
Or another option is Reduce
with lapply
(to avoid conversion to matrix - in case the dataset is really big)
df1[Reduce(`+`, lapply(df1[3:4], is.na)) != 1,]
NOTE: If the NA in the OP's dataset is not real NA, we need to convert it to real NA before doing this (assuming the 4th column is character
class)
is.na(df1[,4]) <- df1[,4] == "<NA>"
df1 <- structure(list(ID = c(60842L, 90853L, 90854L, 83907L, 83908L,
83909L), EMAIL_ID.x = c("[email protected]", "a.b.c", "b.c.d", "[email protected]",
"[email protected]", "[email protected]"), JPNUMBER.y = c(60842L, NA,
NA, 2854L, 952L, 78895L), EMAIL_ID.y = c("[email protected]", NA, NA,
NA, NA, NA)), .Names = c("ID", "EMAIL_ID.x", "JPNUMBER.y", "EMAIL_ID.y"
), row.names = c(NA, -6L), class = "data.frame")
Upvotes: 1