lokheart
lokheart

Reputation: 24665

counting vectors with NA included

By mistake, I found that R count vector with NA included in an interesting way:

> temp <- c(NA,NA,NA,1) # 4 items
> length(temp[temp>1])
[1] 3

> temp <- c(NA,NA,1) # 3 items
> length(temp[temp>1])
[1] 2

At first I assume R will process all NAs into one NA, but this is not the case.

Can anyone explain? Thanks.

Upvotes: 2

Views: 561

Answers (3)

IRTFM
IRTFM

Reputation: 263332

You were expecting only TRUE's and FALSE's (and the results to only be FALSE) but a logical vector can also have NA's. If you were hoping for a length zero result, then you had at least three other choices:

> temp <- c(NA,NA,NA,1) # 4 items
>  length(temp[ which(temp>1) ] )
[1] 0

> temp <- c(NA,NA,NA,1) # 4 items
>  length(subset( temp, temp>1) )
[1] 0

> temp <- c(NA,NA,NA,1) # 4 items
>  length( temp[ !is.na(temp) & temp>1 ] )
[1] 0

You will find the last form in a lot of the internal code of well established functions. I happen to think the first version is more economical and easier to read, but the R Core seems to disagree. I have several times been advised on R help not to use which() around logical expressions. I remain unconvinced. It is correct that one should not combine it with negative indexing.

EDIT The reason not to use the construct "minus which" (negative indexing with which) is that in the case where all the items fail the which-test and where you would therefore expect all of them to be returned , it returns an unexpected empty vector:

 temp <- c(1,2,3,4,NA)
 temp[!temp > 5]
#[1]  1  2  3  4 NA             As expected
 temp[-which(temp > 5)]
#numeric(0)                 Not as expected
 temp[!temp > 5 & !is.na(temp)]
#[1] 1 2 3 4           A correct way to handle negation

I admit that the notion that NA's should select NA elements seems a bit odd, but it is rooted in the history of S and therefore R. There is a section in ?"[" about "NA's in indexing". The rationale is that each NA as an index should return an unknown result, i.e. another NA.

Upvotes: 3

Patrick Burns
Patrick Burns

Reputation: 897

You can use 'sum':

> tmp <- c(NA, NA, NA, 3)
> sum(tmp > 1)
[1] NA
> sum(tmp > 1, na.rm=TRUE)
[1] 1

A bit of explanation: 'sum' expects numbers but 'tmp > 1' is logical. So it is automatically coerced to be numeric: TRUE => 1; FALSE => 0; NA => NA.

I don't think there is anything precisely like this in 'The R Inferno' but this is definitely the sort of question that it is aimed at. http://www.burns-stat.com/pages/Tutor/R_inferno.pdf

Upvotes: 0

Iterator
Iterator

Reputation: 20560

If you break down each command and look at the output, it's more enlightening:

> tmp = c(NA, NA, 1)
> tmp > 1
[1]    NA    NA FALSE
> tmp[tmp > 1]
[1] NA NA

So, when we next perform length(tmp[tmp > 1]), it's as if we're executing length(c(NA,NA)). It is fine to have a vector full of NAs - it has a fixed length (as if we'd created it via NA * vector(length = 2), which should be different from NA * vector(length = 3).

Upvotes: 2

Related Questions