thogs
thogs

Reputation: 1

Subsetting rows from a data frame in R using []

To subset rows from a data frame, inserting the condition in the first part of [ , ] seems to be the reference method, and inserting this condition inside "which()" seems to be useless. However, in the presence of missing data, why is the first method not working, while the "which method" does, as in the following example?

df <- data.frame(var1=c(1,2,3,NA,NA), var2=c(4,0,5,2,3), var3=c(1,2,3,0,6))
testvar1<-df[df$var1==3,]
testvar1.which<-df[which(df$var1==3),]

testvar1

var1 var2 var3
3 3 5 3
NA NA NA NA
NA.1 NA NA NA

testvar1.which

var1 var2 var3
3 3 5 3

Upvotes: 0

Views: 79

Answers (1)

Allan Cameron
Allan Cameron

Reputation: 173793

The simple answer is that which suppresses NA values by default, whereas a straightforward logical test will return a vector of the same length as the input with NA preserved. Compare:

df$var1 == 3
#> [1] FALSE FALSE  TRUE    NA    NA

which(df$var1 == 3)
#> [1] 3

If you subset the data frame with the first result, the first two rows are dropped as expected (because they correspond to FALSE) and the third row is kept because it is TRUE, which is also expected. The last two rows are where the confusion comes in. If you subset a data frame with an NA, you don't get a NULL result, you get an NA result, which is different. The two rows at the bottom are NA rows, which you get if you subset a data frame with NA values.

Upvotes: 2

Related Questions