gh0strider18
gh0strider18

Reputation: 1140

drop levels of factor for which there is one missing value for one column r

I would like to drop any occurrence of a factor level for which one row contains a missing value

Example:

ID var1 var2
1  1    2
1  NA   3
2  1    2
2  2    4

So, in this hypothetical, what would be left would be:

ID var1 var2
2  1    2
2  2    4

Upvotes: 3

Views: 162

Answers (4)

Arun
Arun

Reputation: 118779

In the current development version of data.table, there's a new implementation of na.omit for data.tables, which takes a cols =and invert = arguments.

The cols = allows to specify the columns on which to look for NAs. And invert = TRUE returns the NA rows instead, instead of omitting them.

You can install the devel version by following these instructions. Or you can wait for 1.9.6 on CRAN at some point. Using that, we can do:

require(data.table) ## 1.9.5+
setkey(setDT(df), ID)
df[!na.omit(df, invert = TRUE)]
#    ID var1 var2
# 1:  2    1    2
# 2:  2    2    4

How this works:

  • setDT converts data.frame to data.table by reference.

  • setkey sorts the data.table by the columns provided and marks those columns as key columns so that we can perform a join.

  • na.omit(df, invert = TRUE) gives just those rows that have NA anywhere.

  • X[!Y] does an anit-join by joining on the key column ID, and returns all the rows that don't match ID = 1 (from Y). Check this post to read in detail about data.table's joins.

HTH

Upvotes: 2

talat
talat

Reputation: 70256

Here's one more option in base R. It will check all columns for NAs.

df[!df$ID %in% df$ID[rowSums(is.na(df)) > 0],]
#  ID var1 var2
#3  2    1    2
#4  2    2    4

If you only want to check in column "var1" you can do:

df[!with(df, ID %in% ID[is.na(var1)]),]
#  ID var1 var2
#3  2    1    2
#4  2    2    4

Upvotes: 3

akrun
akrun

Reputation: 886948

Assuming that NAs would occur in both var columns,

 df[with(df, !ave(!!rowSums(is.na(df[,-1])), ID, FUN=any)),]
 #   ID var1 var2
 #3  2    1    2
 #4  2    2    4

Or if it is only specific to var1

 df[with(df, !ave(is.na(var1), ID, FUN=any)),]
 #  ID var1 var2
 #3  2    1    2
 #4  2    2    4

Or using dplyr

 library(dplyr)
 df %>% 
     group_by(ID) %>%
     filter(all(!is.na(var1)))
 #   ID var1 var2
 #1  2    1    2
 #2  2    2    4

data

 df <- structure(list(ID = c(1L, 1L, 2L, 2L), var1 = c(1L, NA, 1L, 2L
 ), var2 = c(2L, 3L, 2L, 4L)), .Names = c("ID", "var1", "var2"
 ), class = "data.frame", row.names = c(NA, -4L))

Upvotes: 3

David Arenburg
David Arenburg

Reputation: 92282

Hers's possible data.table solution (sorry @rawr)

library(data.table)
setDT(df)[, if (all(!is.na(.SD))) .SD, ID]
#    ID var1 var2
# 1:  2    1    2
# 2:  2    2    4

If you only want to check var1 then

df[, if (all(!is.na(var1))) .SD, ID]
#    ID var1 var2
# 1:  2    1    2
# 2:  2    2    4

Upvotes: 4

Related Questions