Lutfi
Lutfi

Reputation: 131

Only omit NA that does not overlap with other NA in a certain column

Let say I have a data frame that looks like this :

>df
col1   col2   col3

 12    NA      2 
 21    11     NA
 NA     2     NA
 3     NA     NA
 NA    NA      4
 8     12      5
 41    39      9

I want to omit the NA, but the NA in col3 is a piece of precious information, so I don't want to fill it in with any other value without more information. Thus, I want to omit the NA that does not overlap with the NA in col3 only.

So it would look like this :

>df
col1   col2   col3

 21    11     NA
 NA     2     NA
 3     NA     NA
 8     12      5
 41    39      9

The only reason why NA in col2 and col1 is still there because deleting their row will cause the NA in col3 to be deleted also, which is what I want to prevent. Thus, I can tolerate those leftover NA from col1 and col2.

Is there any convenient way for me to do this or any package to tackle this problem? I have tried using filter :

df <- df %>% filter(complete.cases(df[, -3]))

But it became like this instead since there is overlapping NA :

 >df
 col1   col2   col3

 21    11     NA
 8     12      5
 41    39      9

Any thoughts? Thanks in advance~

Upvotes: 1

Views: 121

Answers (1)

akrun
akrun

Reputation: 887891

We can create a condition with rowSums

df1[!(rowSums(is.na(df1[-3])) > 0 & !is.na(df1[[3]])),]
#  col1 col2 col3
#2   21   11   NA
#3   NA    2   NA
#4    3   NA   NA
#6    8   12    5
#7   41   39    9

When we do the rowSums(is.na(df1[-3])) > 0, it is checking for any NA in the first two columns and returns TRUE for those cases

rowSums(is.na(df1[-3])) > 0
#[1]  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE

But, we want to remove that first row because there is no NA for that row in 'col3'

Next we check for NAs in 'col3'

is.na(df1[[3]])
#[1] FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE

Wherever, there is TRUE, we want to keep it. So, if we do &, it will return the NA common in both

(rowSums(is.na(df1[-3])) > 0 & !is.na(df1[[3]]))
#[1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE

i.e. 1st and 5th row have NA in both blocks of columns. Negating it changes TRUE-> FALSE and FALSE-> TRUE

!(rowSums(is.na(df1[-3])) > 0 & !is.na(df1[[3]]))
#[1] FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE

which are the rows that can have both sets not having any NA or 'col3' having NA and other rows may or not have NA


Or use the same logic in filter

library(dplyr)
df1  %>%
     filter(!(rowSums(is.na(.[-3])) > 0 & !is.na(col3)))

data

df1 <- structure(list(col1 = c(12L, 21L, NA, 3L, NA, 8L, 41L), col2 = c(NA, 
11L, 2L, NA, NA, 12L, 39L), col3 = c(2L, NA, NA, NA, 4L, 5L, 
9L)), class = "data.frame", row.names = c(NA, -7L))

Upvotes: 1

Related Questions