patward5656
patward5656

Reputation: 87

Filtering a dataframe based on a sequence of variables

I have a dataframe that has a sequence of variables, c1...c20. Each of these variables contain a code. I have a vector of codes, code.vec, and I would like to subset the dataframe to contain records where c1|c2|c3|...|c20 are in code.vec.

Example data (only using 3 cn variables for the example):

code.vec<-c("T1", "T2", "T3", "T4")

c1<-c("T1", "X1", "T6", "R5")
c2<-c("R4", "C6", "C7", "X3")
c3<-c("C5", "C2", "X4", "T2")

df<-data.frame(c1, c2, c3)

This is what I am currently doing:

library(dplyr)
df %>% filter(c1 %in% code.vec | c2 %in% code.vec | c3 %in% code.vec)

  c1 c2 c3
1 T1 R4 C5
2 R5 X3 T2

This works, but since the real dataframe has 20 cn variables, it becomes a lot of typing. It seems like there should be a simple apply or loop solution to this (and is easy to do in SAS using an array and a do loop) but I cannot work out a solution in R, and I can't find any similar questions on here.

Upvotes: 1

Views: 871

Answers (2)

acylam
acylam

Reputation: 18681

Here is a simple solution using filter_all from dplyr:

library(dplyr)

df %>% 
  filter_all(any_vars(. %in% code.vec))

Result:

  c1 c2 c3
1 T1 R4 C5
2 R5 X3 T2

Mentioned in the comments, if instead you want to filter on rows where all variables contain code.vec, you can use replace any_vars with all_vars:

df %>% 
  filter_all(all_vars(. %in% code.vec))

Upvotes: 1

lmo
lmo

Reputation: 38510

Here is one method that should be relatively fast.

# get the position of the rows that match using modulus (final row returns 0)
temp <- which(unlist(dat) %in% code.vec) %% nrow(dat)
# replace 0s with final row
temp[temp == 0] <- nrow(dat)

Then subset

dat[unique(sort(temp)),]
  c1 c2 c3
1 T1 R4 C5
4 R5 X3 T2

Note that I was working with character vectors in your data.frame variables. If you have them stored as factors you'll need to wrap unlist(dat) in as.character().

data

dat <-
structure(list(c1 = c("T1", "X1", "T6", "R5"), c2 = c("R4", "C6", 
"C7", "X3"), c3 = c("C5", "C2", "X4", "T2")), .Names = c("c1", 
"c2", "c3"), row.names = c(NA, -4L), class = "data.frame")

Upvotes: 0

Related Questions