Reputation: 4469
I have a dataframe with some 20 columns and some 10^7 rows. One of the columns is an id column that is a factor. I want to filter the rows by properties of the string representation of the levels of the factor. The code below achieves this, but seems to me to be really rather inelegant. In particular that I have to create a vector of the relevant ids seems to me should not be needed.
Any suggestions for streamlining this?
library(dplyr)
library(tidyr)
library(gdata)
dat <- data.frame(id=factor(c("xxx-nld", "xxx-jap", "yyy-aus", "zzz-ita")))
europ.id <- function(id) {
ctry.code <- substring(id, nchar(id)-2)
ctry.code %in% c("nld", "ita")
}
ids <- levels(dat$id)
europ.ids <- subset(ids, europ.campaign(ids))
datx <- dat %>% filter(id %in% europ.ids) %>% drop.levels
Upvotes: 1
Views: 1510
Reputation: 4469
Docendo Discimus gave the right answer in comments. To explain it first see the error I kept getting in my different attempts
> dat %>% filter(europ.id(id))
Error in nchar(id) : 'nchar()' requires a character vector
Calls: %>% ... filter_impl -> .Call -> europ.id -> substring -> nchar
Then note that his solution works because grepl applies as.character
to its argument if needed (from the man: a character vector where matches are sought, or an object which can be coerced by as.character to a character vector). This implicit application of as.character
also happens if you use %in%
. Since this solution is also perfectly performant, we can do the following
dat %>% filter(europ.id(as.character(id)) %>% droplevels
Or to make it read a bit nicer update the function to
europ.id <- function(id) {
ids <- as.character(id)
ctry.code <- substring(ids, nchar(ids)-2)
ctry.code %in% c("nld", "ita")
}
and use
dat %>% filter(europ.id(id)) %>% droplevels
which reads exactly like what I was looking for.
Upvotes: 3