Reputation: 13
I am currently taking the Getting and Cleaning Data Course on Coursera :D
The first quiz contained this question: How many properties are worth more than 1000 000$? Looking at the code book it is clear, properties are listed in column VAL and assigned the number 24 for houses worth equal or more than 1 mio dollar.
My first attempt to solve this question looked like this:
length(data$VAL[data$VAL=="24"])
however this didnt get me the right answer. By chance (and after some nervous breakdowns) i tried this (and it worked):
length(data$VAL[!is.na(data$VAL) & data$VAL=="24"])
Now i had the right solution but i dont really understand why this works. In my first attempt above it seems all the NAs were included too, although i specified for data$VAL=="24"
Can anybody please elaborate as to why my first guess didnt work bit the second did? It seems counterintuitive to me. :/
Best wishes and thanks for your thoughts, Dominic
Upvotes: 1
Views: 101
Reputation: 25405
Sample data:
data = data.frame(VAL=c('24','24','24',NA,NA))
Let's first look at
data$VAL=="24"
which returns
[1] TRUE TRUE TRUE NA NA
So when you do
data$VAL[data$VAL=="24"]
We tell R to include from data$VAL
all elements where data$VAL=="24"
is TRUE
, and to leave out those where it is FALSE
(try for example c(1,2,3)[c(TRUE,FALSE,TRUE)]
). For the fourth and fifth element, you specify not TRUE
or FALSE
to include the element, but NA
and NA
. So you get
[1] 24 24 24 <NA> <NA>
which has indeed a length of 5
. When you do
data$VAL[!is.na(data$VAL) & data$VAL=="24"]
you specify that you do not want to take the elements that are NA
, and thus the length is 3
.
If we take one step back, we see that we want to count the number of TRUE
's in data$VAL=="24"
. We can also do that with for example:
sum(data$VAL=="24",na.rm=TRUE)
which returns 3
, since the na.rm
argument specifies that we want to remove the NA
's from the vector before summing. Hope this helps!
Upvotes: 0
Reputation:
The vector data$VAL == "24"
has values which are either TRUE
, FALSE
or NA
, depending on whether data$VAL
is 24, something else but not NA
, or NA
. When you subset a vector using a logical vector, NA
s are included but the values become NA
themselves:
> a <- 1:5
> a[c(TRUE, FALSE, TRUE, FALSE, NA)]
[1] 1 3 NA
A shortcut for your case would have been sum(data$VAL==24, na.rm = TRUE)
which sums the logical vector, converting it into 0s and 1s, and removing NA
s.
Upvotes: 2