Reputation: 131
Let say I have a data frame that looks like this :
>df
col1 col2 col3
12 NA 2
21 11 NA
NA 2 NA
3 NA NA
NA NA 4
8 12 5
41 39 9
I want to omit the NA, but the NA in col3 is a piece of precious information, so I don't want to fill it in with any other value without more information. Thus, I want to omit the NA that does not overlap with the NA in col3 only.
So it would look like this :
>df
col1 col2 col3
21 11 NA
NA 2 NA
3 NA NA
8 12 5
41 39 9
The only reason why NA in col2 and col1 is still there because deleting their row will cause the NA in col3 to be deleted also, which is what I want to prevent. Thus, I can tolerate those leftover NA from col1 and col2.
Is there any convenient way for me to do this or any package to tackle this problem? I have tried using filter :
df <- df %>% filter(complete.cases(df[, -3]))
But it became like this instead since there is overlapping NA :
>df
col1 col2 col3
21 11 NA
8 12 5
41 39 9
Any thoughts? Thanks in advance~
Upvotes: 1
Views: 121
Reputation: 887891
We can create a condition with rowSums
df1[!(rowSums(is.na(df1[-3])) > 0 & !is.na(df1[[3]])),]
# col1 col2 col3
#2 21 11 NA
#3 NA 2 NA
#4 3 NA NA
#6 8 12 5
#7 41 39 9
When we do the rowSums(is.na(df1[-3])) > 0
, it is checking for any NA in the first two columns and returns TRUE for those cases
rowSums(is.na(df1[-3])) > 0
#[1] TRUE FALSE TRUE TRUE TRUE FALSE FALSE
But, we want to remove that first row because there is no NA
for that row in 'col3'
Next we check for NA
s in 'col3'
is.na(df1[[3]])
#[1] FALSE TRUE TRUE TRUE FALSE FALSE FALSE
Wherever, there is TRUE, we want to keep it. So, if we do &
, it will return the NA
common in both
(rowSums(is.na(df1[-3])) > 0 & !is.na(df1[[3]]))
#[1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE
i.e. 1st and 5th row have NA in both blocks of columns. Negating it changes TRUE-> FALSE
and FALSE-> TRUE
!(rowSums(is.na(df1[-3])) > 0 & !is.na(df1[[3]]))
#[1] FALSE TRUE TRUE TRUE FALSE TRUE TRUE
which are the rows that can have both sets not having any NA or 'col3' having NA and other rows may or not have NA
Or use the same logic in filter
library(dplyr)
df1 %>%
filter(!(rowSums(is.na(.[-3])) > 0 & !is.na(col3)))
df1 <- structure(list(col1 = c(12L, 21L, NA, 3L, NA, 8L, 41L), col2 = c(NA,
11L, 2L, NA, NA, 12L, 39L), col3 = c(2L, NA, NA, NA, 4L, 5L,
9L)), class = "data.frame", row.names = c(NA, -7L))
Upvotes: 1