Reputation: 1580
I have a large number of data frame columns beginning with "country" and suffixed numerically by 1 (e.g., country1
, country2
, country3
). I want to use dplyr
to filter
my dataframe df
to rows where ONE OR MORE of the variables include certain country names as values (i.e., country1 OR country2 OR country3 %in% "Brazil
"), along with a number of other conditions.
I've tried using grep
to do this, but the problem I encounter is that the function is evaluated at the same time as the rest of the filter
function, so I get an error message. Here's what I've tried:
df1 <- filter(df,
a %in% value1
& b %in% value2
& grep("^country.*", colnames(df), value = TRUE) %in% c("Brazil")
Of course, R just evaluates the grep
function as ...
& c("country1", "country2", "country3", ... ) %in% c("Brazil")
And returns a logical vector, where obviously none of the values == "Brazil":
[1] FALSE FALSE FALSE ...
Is there a way to do this without laboriously typing all variables? For example...
& (country1 %in% "Brazil" | country2 %in% "Brazil" | country3
...
Thanks!
Upvotes: 2
Views: 2254
Reputation: 1007
More compact version:
df %>% filter(str_detect(country1:country3, "Brazil"))
# country1 country2 country3
# 1 A B Brazil
(requires stringr
as well as dplyr
)
This may also work, per @beginneR's comment:
df %>% filter(str_detect(starts_with("country"), "Brazil"))
Upvotes: 6
Reputation: 70256
You could also try something like this. I start by creating an index of the rows where "Brazil" is present in any of the "country" columns and then use the index to subset/filter the data:
indx <- df %>% select(starts_with("country")) %>%
mutate_each(funs(grepl("Brazil", .))) %>%
rowSums > 0
indx
#[1] TRUE FALSE
df %>% filter(indx)
# country1 country2 country3
#1 A B Brazil
df <- data.frame(country1 = "A", country2 = "B", country3 = c("Brazil", "C"))
df
# country1 country2 country3
#1 A B Brazil
#2 A B C
If you want to look for multiple country names, you only need a slight modification to paste the country names of interest together, e.g. like this:
First, I'll slightly modify the sample data to include a third row with "Argentina".
df <- data.frame(country1 = "A", country2 = "B", country3 = c("Brazil", "C", "Argentina"))
df
# country1 country2 country3
#1 A B Brazil
#2 A B C
#3 A B Argentina
Then you can store the country names you want to look for in a vector:
x <- c("Brazil", "Argentina", "USA")
Now, build the indx again, but paste the country names of "x" together so that grepl
will look for any of them (trick is to collapse with "|" - logical OR):
indx <- df %>% select(starts_with("country")) %>%
mutate_each(funs(grepl(paste(x, collapse = "|"), .))) %>%
rowSums > 0
Now you can subset your data either the standard way or for example using dplyr:
df[indx,]
# country1 country2 country3
#1 A B Brazil
#3 A B Argentina
df %>% filter(indx)
# country1 country2 country3
#1 A B Brazil
#2 A B Argentina
Upvotes: 3
Reputation: 92282
Here's a vectorized solution using base R
dat[rowSums(dat[grepl("^country\\d+$", names(dat))] == "Brazil") > 0, ]
## Random1 Random2 country1 country2 country3
## 1 C D A B Brazil
Or data.table
s binary search (which will be a certain winner for a big data set)
library(data.table)
setDT(dat)[, indx := rowSums(.SD[, grepl("^country\\d+$", names(dat)), with = F] == "Brazil") > 0]
setkey(dat, indx)[J(TRUE)]
## Random1 Random2 country1 country2 country3 indx
## 1: C D A B Brazil TRUE
Data set
dat <- data.frame(Random1 = c("C", "Brazil"), Random2 = "D",
country1 = "A", country2 = "B", country3 = c("Brazil", "C"))
# Random1 Random2 country1 country2 country3
# 1 C D A B Brazil
# 2 Brazil D A B C
Upvotes: 4
Reputation: 99331
You can use a logical matrix as a helper and take the unique rows
dat[sort(unique(row(dat)[dat == "Brazil"])),]
# country1 country2 country3 country4 country5
# 3 Brazil Brazil Argentina Brazil Brazil
# 4 Mexico Chile Brazil Chile Argentina
# 5 Brazil Chile Argentina Brazil Argentina
# 6 Mexico Chile Argentina Brazil Brazil
Altrnatively, you can use which
w <- which(dat == "Brazil", arr.ind=TRUE)[,"row"]
dat[sort(unique(w)),]
Where the dat
data is:
set.seed(25)
country <- c("Brazil", "Chile", "Argentina", "Mexico")
dat <- as.data.frame(matrix(sample(country, 30, TRUE), 6))
dat <- setNames(dat, paste0("country", 1:ncol(dat)))
Upvotes: 1
Reputation: 9582
Can you paste the columns together to get a super column and then grep it for Brazil?
Example:
set.seed(1)
dat <- data.frame(matrix(sample(c(rep('Canada',10), 'Brazil'), 20, replace=T),
ncol=5),
stringsAsFactors=F)
> dat
X1 X2 X3 X4 X5
1 Canada Canada Canada Canada Canada
2 Canada Canada Canada Canada Brazil
3 Canada Brazil Canada Canada Canada
4 Canada Canada Canada Canada Canada
Then:
dat$X6 <- apply(dat, 1, paste, collapse="")
grepl('Brazil', dat$X6)
> grepl('Brazil', dat$X6)
[1] FALSE TRUE TRUE FALSE
As expected, returns TRUE only for rows with Brazil in them (i.e. 2 and 3)
Upvotes: 0