Reputation: 1580
Say I write the following code to produce a dataframe:
name <- c("Joe","John","Susie","Mack","Mo","Curly","Jim")
age <- c(1,2,3,NaN,4,5,NaN)
DOB <- c(10000, 12000, 16000, NaN, 18000, 20000, 22000)
DOB <- as.Date(DOB, origin = "1960-01-01")
trt <- c(0, 1, 1, 2, 2, 1, 1)
df <- data.frame(name, age, DOB, trt)
that looks like this:
name age DOB trt
1 Joe 1 1987-05-19 0
2 John 2 1992-11-08 1
3 Susie 3 2003-10-22 1
4 Mack NaN <NA> 2
5 Mo 4 2009-04-13 2
6 Curly 5 2014-10-04 1
7 Jim NaN 2020-03-26 1
How would I be able to remove rows where both age and DOB have missing values for the row? For example, I'd like a new dataframe (df2) to look like this:
name age DOB trt
1 Joe 1 1987-05-19 0
2 John 2 1992-11-08 1
3 Susie 3 2003-10-22 1
5 Mo 4 2009-04-13 2
6 Curly 5 2014-10-04 1
7 Jim NaN 2020-03-26 1
I've tried the following code, but it deleted too many rows:
df2 <- df[!(is.na(df$age)) & !(is.na(df$DOB)), ]
In SAS, I would just write
WHERE missing(age) ge 1 AND missing(DOB) ge 1
in a DATA step, but obviously R has different syntax.
Thanks in advance!
Upvotes: 1
Views: 4340
Reputation: 589
Maybe this could be easier:
require(tidyverse)
df <- drop_na(df, c("age", "DOB"))
Upvotes: 1
Reputation: 70256
If you want to remove those rows where two columns (age and DOB) have more than 1 NA (which would mathematically mean that there could only be 2 NAs in such a case), you can do for example:
df[!is.na(df$age) | !is.na(df$DOB),]
which means that either both or one of the columns should be not NA, or
df[rowSums(is.na(df[2:3])) < 2L,]
which means that the sum of NAs in columns 2 and 3 should be less than 2 (hence, 1 or 0) or very similar:
df[rowSums(is.na(df[c("age", "DOB")])) < 2L,]
And of course there's other options, like what @rawr provided in the comments.
And to better understand the subsetting, check this:
rowSums(is.na(df[2:3]))
#[1] 0 0 0 2 0 0 1
rowSums(is.na(df[2:3])) < 2L
#[1] TRUE TRUE TRUE FALSE TRUE TRUE TRUE
Upvotes: 2
Reputation: 1437
You were pretty close
df[!(is.na(df$age) & is.na(df$DOB)), ]
or
df[!is.na(df$age) | !is.na(df$DOB), ]
Upvotes: 1