kasterma
kasterma

Reputation: 4469

dplyr::filter used with a function on string representation of factor

I have a dataframe with some 20 columns and some 10^7 rows. One of the columns is an id column that is a factor. I want to filter the rows by properties of the string representation of the levels of the factor. The code below achieves this, but seems to me to be really rather inelegant. In particular that I have to create a vector of the relevant ids seems to me should not be needed.

Any suggestions for streamlining this?

library(dplyr)
library(tidyr)
library(gdata)

dat <- data.frame(id=factor(c("xxx-nld", "xxx-jap", "yyy-aus", "zzz-ita")))

europ.id <- function(id) {
  ctry.code <- substring(id, nchar(id)-2)
  ctry.code %in% c("nld", "ita")
}

ids <- levels(dat$id)
europ.ids <- subset(ids, europ.campaign(ids))

datx <- dat %>% filter(id %in% europ.ids) %>% drop.levels

Upvotes: 1

Views: 1510

Answers (1)

kasterma
kasterma

Reputation: 4469

Docendo Discimus gave the right answer in comments. To explain it first see the error I kept getting in my different attempts

> dat %>% filter(europ.id(id))
Error in nchar(id) : 'nchar()' requires a character vector
Calls: %>% ... filter_impl -> .Call -> europ.id -> substring -> nchar

Then note that his solution works because grepl applies as.character to its argument if needed (from the man: a character vector where matches are sought, or an object which can be coerced by as.character to a character vector). This implicit application of as.character also happens if you use %in%. Since this solution is also perfectly performant, we can do the following

dat %>% filter(europ.id(as.character(id)) %>% droplevels

Or to make it read a bit nicer update the function to

europ.id <- function(id) {
  ids <- as.character(id)
  ctry.code <- substring(ids, nchar(ids)-2)
  ctry.code %in% c("nld", "ita")
}

and use

dat %>% filter(europ.id(id)) %>% droplevels

which reads exactly like what I was looking for.

Upvotes: 3

Related Questions