Reputation: 87
I have a dataframe that has a sequence of variables, c1...c20
. Each of these variables contain a code. I have a vector of codes, code.vec
, and I would like to subset the dataframe to contain records where c1|c2|c3|...|c20
are in code.vec
.
Example data (only using 3 cn
variables for the example):
code.vec<-c("T1", "T2", "T3", "T4")
c1<-c("T1", "X1", "T6", "R5")
c2<-c("R4", "C6", "C7", "X3")
c3<-c("C5", "C2", "X4", "T2")
df<-data.frame(c1, c2, c3)
This is what I am currently doing:
library(dplyr)
df %>% filter(c1 %in% code.vec | c2 %in% code.vec | c3 %in% code.vec)
c1 c2 c3
1 T1 R4 C5
2 R5 X3 T2
This works, but since the real dataframe has 20 cn
variables, it becomes a lot of typing. It seems like there should be a simple apply or loop solution to this (and is easy to do in SAS using an array and a do loop) but I cannot work out a solution in R, and I can't find any similar questions on here.
Upvotes: 1
Views: 871
Reputation: 18681
Here is a simple solution using filter_all
from dplyr
:
library(dplyr)
df %>%
filter_all(any_vars(. %in% code.vec))
Result:
c1 c2 c3
1 T1 R4 C5
2 R5 X3 T2
Mentioned in the comments, if instead you want to filter on rows where all variables contain code.vec
, you can use replace any_vars
with all_vars
:
df %>%
filter_all(all_vars(. %in% code.vec))
Upvotes: 1
Reputation: 38510
Here is one method that should be relatively fast.
# get the position of the rows that match using modulus (final row returns 0)
temp <- which(unlist(dat) %in% code.vec) %% nrow(dat)
# replace 0s with final row
temp[temp == 0] <- nrow(dat)
Then subset
dat[unique(sort(temp)),]
c1 c2 c3
1 T1 R4 C5
4 R5 X3 T2
Note that I was working with character vectors in your data.frame variables. If you have them stored as factors you'll need to wrap unlist(dat)
in as.character()
.
data
dat <-
structure(list(c1 = c("T1", "X1", "T6", "R5"), c2 = c("R4", "C6",
"C7", "X3"), c3 = c("C5", "C2", "X4", "T2")), .Names = c("c1",
"c2", "c3"), row.names = c(NA, -4L), class = "data.frame")
Upvotes: 0