mcjudd
mcjudd

Reputation: 1580

Filtering rows by specific values %in% ONE OR MORE specified columns

I have a large number of data frame columns beginning with "country" and suffixed numerically by 1 (e.g., country1, country2, country3). I want to use dplyr to filter my dataframe df to rows where ONE OR MORE of the variables include certain country names as values (i.e., country1 OR country2 OR country3 %in% "Brazil"), along with a number of other conditions.

I've tried using grep to do this, but the problem I encounter is that the function is evaluated at the same time as the rest of the filter function, so I get an error message. Here's what I've tried:

df1 <- filter(df, 
                a %in% value1
              & b %in% value2
              & grep("^country.*", colnames(df), value = TRUE) %in% c("Brazil")

Of course, R just evaluates the grep function as ...

& c("country1", "country2", "country3", ... ) %in% c("Brazil")

And returns a logical vector, where obviously none of the values == "Brazil":

[1] FALSE FALSE FALSE ...

Is there a way to do this without laboriously typing all variables? For example...

& (country1 %in% "Brazil" | country2 %in% "Brazil" | country3 ...

Thanks!

Upvotes: 2

Views: 2254

Answers (5)

dholstius
dholstius

Reputation: 1007

More compact version:

df %>% filter(str_detect(country1:country3, "Brazil"))
#  country1 country2 country3
#         1        A        B   Brazil

(requires stringr as well as dplyr)

This may also work, per @beginneR's comment:

df %>% filter(str_detect(starts_with("country"), "Brazil"))

Upvotes: 6

talat
talat

Reputation: 70256

You could also try something like this. I start by creating an index of the rows where "Brazil" is present in any of the "country" columns and then use the index to subset/filter the data:

indx <- df %>% select(starts_with("country")) %>% 
             mutate_each(funs(grepl("Brazil", .))) %>% 
             rowSums > 0
indx
#[1]  TRUE FALSE

df %>% filter(indx)
#  country1 country2 country3
#1        A        B   Brazil

Sample data:

df <- data.frame(country1 = "A", country2 = "B", country3 = c("Brazil", "C"))
df
#  country1 country2 country3
#1        A        B   Brazil
#2        A        B        C

Edit

If you want to look for multiple country names, you only need a slight modification to paste the country names of interest together, e.g. like this:

First, I'll slightly modify the sample data to include a third row with "Argentina".

df <- data.frame(country1 = "A", country2 = "B", country3 = c("Brazil", "C", "Argentina"))
df
#  country1 country2  country3
#1        A        B    Brazil
#2        A        B         C
#3        A        B Argentina

Then you can store the country names you want to look for in a vector:

x  <- c("Brazil", "Argentina", "USA")

Now, build the indx again, but paste the country names of "x" together so that grepl will look for any of them (trick is to collapse with "|" - logical OR):

indx <- df %>% select(starts_with("country")) %>% 
  mutate_each(funs(grepl(paste(x, collapse = "|"), .))) %>% 
  rowSums > 0

Now you can subset your data either the standard way or for example using dplyr:

df[indx,]
#  country1 country2  country3
#1        A        B    Brazil
#3        A        B Argentina
df %>% filter(indx)
#  country1 country2  country3
#1        A        B    Brazil
#2        A        B Argentina

Upvotes: 3

David Arenburg
David Arenburg

Reputation: 92282

Here's a vectorized solution using base R

dat[rowSums(dat[grepl("^country\\d+$", names(dat))] == "Brazil") > 0, ]
##   Random1 Random2 country1 country2 country3
## 1       C       D        A        B   Brazil

Or data.tables binary search (which will be a certain winner for a big data set)

library(data.table)
setDT(dat)[, indx := rowSums(.SD[, grepl("^country\\d+$", names(dat)), with = F] == "Brazil") > 0]
setkey(dat, indx)[J(TRUE)]
##    Random1 Random2 country1 country2 country3 indx
## 1:       C       D        A        B   Brazil TRUE

Data set

dat <- data.frame(Random1 = c("C", "Brazil"), Random2 = "D", 
                  country1 = "A", country2 = "B", country3 = c("Brazil", "C"))
#   Random1 Random2 country1 country2 country3
# 1       C       D        A        B   Brazil
# 2  Brazil       D        A        B        C 

Upvotes: 4

Rich Scriven
Rich Scriven

Reputation: 99331

You can use a logical matrix as a helper and take the unique rows

dat[sort(unique(row(dat)[dat == "Brazil"])),]
#   country1 country2  country3 country4  country5
# 3   Brazil   Brazil Argentina   Brazil    Brazil
# 4   Mexico    Chile    Brazil    Chile Argentina
# 5   Brazil    Chile Argentina   Brazil Argentina
# 6   Mexico    Chile Argentina   Brazil    Brazil

Altrnatively, you can use which

w <- which(dat == "Brazil", arr.ind=TRUE)[,"row"]
dat[sort(unique(w)),]

Where the dat data is:

set.seed(25)
country <- c("Brazil", "Chile", "Argentina", "Mexico")
dat <- as.data.frame(matrix(sample(country, 30, TRUE), 6))
dat <- setNames(dat, paste0("country", 1:ncol(dat)))

Upvotes: 1

arvi1000
arvi1000

Reputation: 9582

Can you paste the columns together to get a super column and then grep it for Brazil?

Example:

set.seed(1)
dat <- data.frame(matrix(sample(c(rep('Canada',10), 'Brazil'), 20, replace=T),
                         ncol=5),
                  stringsAsFactors=F)

> dat
      X1     X2     X3     X4     X5
1 Canada Canada Canada Canada Canada
2 Canada Canada Canada Canada Brazil
3 Canada Brazil Canada Canada Canada
4 Canada Canada Canada Canada Canada

Then:

dat$X6 <- apply(dat, 1, paste, collapse="")
grepl('Brazil', dat$X6)

> grepl('Brazil', dat$X6)
[1] FALSE  TRUE  TRUE FALSE

As expected, returns TRUE only for rows with Brazil in them (i.e. 2 and 3)

Upvotes: 0

Related Questions