Subsetting in R returns different values using %in% vs. == operator

Question

I'm new to R, and learning several ways to subset data. I'm puzzle by the difference in the number of matches using the below in a Restaurant Data (RestData) set by zipcode.

> nrow(restData[restData$zipCode %in% c("21212","21213"),])
# [1] 59
> nrow(restData[restData$zipCode == c("21212","21213"),])
# [1] 26

>Warning message:
In restData$zipCode == c("21212", "21213") :
 longer object length is not a multiple of shorter object length

I'm using the below dataset in case you want to replicate

fileURL <- "https://data.baltimorecity.gov/api/views/k5ry-ef3g/rows.csv?accessType=DOWNLOAD"
download.file(fileURL, destfile = "./Rdata/restaurants.csv", method = "curl")
restData <- read.csv("./Rdata/restaurants.csv")

Ricardo Oliveros-Ramos · Accepted Answer

You are not getting the same answers because both lines are not doing the same! The first one is the right one if you want to index the positions of RestData where the zip code is either "21212" or "21213". See ?"%in%" for more details.

In the second line, it's important to know R will "recycle" the elements of a shorter vector when needed for a binary operation. For example, 1:6 + 1:2 will recycle the second vector (of length two) to make it a vector of length 6, so you really do 1:6 + rep(1:2, length=6). In your case, you are doing

restData$zipCode == rep(c("21212", "21213"), length=nrow(restData))

an the comparison is done element by element. So, it's telling you when the odd positions are "21212" or the even positions are "21213". The warning you're getting is important, because is telling you're comparing a vector of odd length with a vector of even length. In some cases the recycling can be useful, for example restData$zipCode[c(TRUE, FALSE)] will retrieve only the odd positions.

Subsetting in R returns different values using %in% vs. == operator

Answers (1)

Related Questions