eclairs
eclairs

Reputation: 1695

deleting rows in R containing one blank column

I have a data set like:

  ID               EMAIL_ID.x        ID.y     EMAIL_ID.y
 60842               [email protected]      60842        [email protected]
 90853                a.b.c         NA            <NA>
 90854                b.c.d         NA            <NA>
 83907               [email protected]      02854         <NA>
 83908             [email protected]    00952         <NA>
 83909             [email protected]     78895         <NA>

In this data, i only want to retain the rows in which both the 3rd and the 4th column are either populated or both not populated.

Hence the result should be like:

   ID               EMAIL_ID.x       ID.y     EMAIL_ID.y
 60842               [email protected]      60842        [email protected]
 90853                a.b.c         NA            <NA>
 90854                b.c.d         NA            <NA>

In the above, the rows where column 3 had values but column 4 did not, have been deleted.

Please suggest how can this be done in R.

Thankyou!

Upvotes: 0

Views: 121

Answers (1)

akrun
akrun

Reputation: 887951

One option is to check whether there are NA elements (is.na(...)) in both (&) columns or (|) both non-NA elements (by negating !) and use that logical index to subset the rows.

df1[is.na(df1[,3]) & is.na(df1[,4]) | !is.na(df1[,3]) & !is.na(df1[,4]),]
#    ID EMAIL_ID.x JPNUMBER.y EMAIL_ID.y
#1 60842  [email protected]      60842  [email protected]
#2 90853      a.b.c         NA       <NA>
#3 90854      b.c.d         NA       <NA>

Or another option is to apply rowSums on the logical matrix (is.na(df1[3:4])), check if the sum of NA elements are either 0 or 2 using %in% (if it is 0 - it implies there are no NA elements and 2 means both the columns have NA) and use the logical vector to subset the rows.

df1[rowSums(is.na(df1[3:4])) %in% c(0,2),]
#    ID EMAIL_ID.x JPNUMBER.y EMAIL_ID.y
#1 60842  [email protected]      60842  [email protected]
#2 90853      a.b.c         NA       <NA>
#3 90854      b.c.d         NA       <NA>

Or another option is Reduce with lapply (to avoid conversion to matrix - in case the dataset is really big)

df1[Reduce(`+`, lapply(df1[3:4], is.na)) != 1,]

NOTE: If the NA in the OP's dataset is not real NA, we need to convert it to real NA before doing this (assuming the 4th column is character class)

is.na(df1[,4]) <- df1[,4] == "<NA>"

data

df1 <- structure(list(ID = c(60842L, 90853L, 90854L, 83907L, 83908L, 
83909L), EMAIL_ID.x = c("[email protected]", "a.b.c", "b.c.d", "[email protected]", 
"[email protected]", "[email protected]"), JPNUMBER.y = c(60842L, NA, 
NA, 2854L, 952L, 78895L), EMAIL_ID.y = c("[email protected]", NA, NA, 
NA, NA, NA)), .Names = c("ID", "EMAIL_ID.x", "JPNUMBER.y", "EMAIL_ID.y"
), row.names = c(NA, -6L), class = "data.frame")

Upvotes: 1

Related Questions