Reputation: 1
To subset rows from a data frame, inserting the condition in the first part of [ , ] seems to be the reference method, and inserting this condition inside "which()" seems to be useless. However, in the presence of missing data, why is the first method not working, while the "which method" does, as in the following example?
df <- data.frame(var1=c(1,2,3,NA,NA), var2=c(4,0,5,2,3), var3=c(1,2,3,0,6))
testvar1<-df[df$var1==3,]
testvar1.which<-df[which(df$var1==3),]
testvar1
var1 | var2 | var3 | |
---|---|---|---|
3 | 3 | 5 | 3 |
NA | NA | NA | NA |
NA.1 | NA | NA | NA |
testvar1.which
var1 | var2 | var3 | |
---|---|---|---|
3 | 3 | 5 | 3 |
Upvotes: 0
Views: 79
Reputation: 173793
The simple answer is that which
suppresses NA
values by default, whereas a straightforward logical test will return a vector of the same length as the input with NA
preserved. Compare:
df$var1 == 3
#> [1] FALSE FALSE TRUE NA NA
which(df$var1 == 3)
#> [1] 3
If you subset the data frame with the first result, the first two rows are dropped as expected (because they correspond to FALSE
) and the third row is kept because it is TRUE
, which is also expected. The last two rows are where the confusion comes in. If you subset a data frame with an NA
, you don't get a NULL
result, you get an NA
result, which is different. The two rows at the bottom are NA
rows, which you get if you subset a data frame with NA
values.
Upvotes: 2