Dominic
Dominic

Reputation: 13

R programming /get number of variables in a column -

I am currently taking the Getting and Cleaning Data Course on Coursera :D

The first quiz contained this question: How many properties are worth more than 1000 000$? Looking at the code book it is clear, properties are listed in column VAL and assigned the number 24 for houses worth equal or more than 1 mio dollar.

My first attempt to solve this question looked like this:

length(data$VAL[data$VAL=="24"])

however this didnt get me the right answer. By chance (and after some nervous breakdowns) i tried this (and it worked):

length(data$VAL[!is.na(data$VAL) & data$VAL=="24"])

Now i had the right solution but i dont really understand why this works. In my first attempt above it seems all the NAs were included too, although i specified for data$VAL=="24"

Can anybody please elaborate as to why my first guess didnt work bit the second did? It seems counterintuitive to me. :/

Best wishes and thanks for your thoughts, Dominic

Upvotes: 1

Views: 101

Answers (2)

Florian
Florian

Reputation: 25405

Sample data:

data = data.frame(VAL=c('24','24','24',NA,NA))

Let's first look at

data$VAL=="24"

which returns

 [1] TRUE TRUE TRUE   NA   NA

So when you do

data$VAL[data$VAL=="24"]

We tell R to include from data$VAL all elements where data$VAL=="24" is TRUE, and to leave out those where it is FALSE (try for example c(1,2,3)[c(TRUE,FALSE,TRUE)]). For the fourth and fifth element, you specify not TRUE or FALSE to include the element, but NA and NA. So you get

[1] 24   24   24   <NA> <NA>

which has indeed a length of 5. When you do

data$VAL[!is.na(data$VAL) & data$VAL=="24"] 

you specify that you do not want to take the elements that are NA, and thus the length is 3.

If we take one step back, we see that we want to count the number of TRUE's in data$VAL=="24". We can also do that with for example:

sum(data$VAL=="24",na.rm=TRUE)

which returns 3, since the na.rm argument specifies that we want to remove the NA's from the vector before summing. Hope this helps!

Upvotes: 0

user3603486
user3603486

Reputation:

The vector data$VAL == "24" has values which are either TRUE, FALSE or NA, depending on whether data$VAL is 24, something else but not NA, or NA. When you subset a vector using a logical vector, NAs are included but the values become NA themselves:

> a <- 1:5
> a[c(TRUE, FALSE, TRUE, FALSE, NA)]
[1]  1  3 NA

A shortcut for your case would have been sum(data$VAL==24, na.rm = TRUE) which sums the logical vector, converting it into 0s and 1s, and removing NAs.

Upvotes: 2

Related Questions