Makaira Murakami
Makaira Murakami

Reputation: 61

Looping with filter (dplyr) through multiple columns with multiple values to match

I have the following data set, 'data':

       A1    A2    A3    A4    A5   
case1  1234  2234  223   4455  7334
case2  1254  2234  200   
case3  1234  9234  
case4  1274  5234  228   4655  2534
case5  1234  2634  273   7455  

Each row represents a case and each column represents a code attribute for that case, where each case can have multiple codes or no codes at all. No case can have multiple instances of the same code.

Also important is that each code can be either 3 or 4 digits and the 4 digit codes are children of the 3 digit codes. For instance, '123' would be a more broad code than '1234'; the '4' indicates a more specific type of '123' code.

My original function for pulling at cases that have the code '1234' was

dataf <- filter(data, A1:A5 == 1234)

, but the statement attempted to evaluate with an & statement rather than an OR statement, which is not what I'm looking for.

So currently I am filtering like so:

dataf <- filter(data, A1 == 1234 | A2 == 1234 | A3 == 1234 | A4 == 1234 | A5 == 1234)

which works fine for one specific code, but the amount of typing if I wanted to filter out more codes would be prohibitive.

Ultimately, what I would like to do is query all columns A1:A5 for any cases that have codes matching any values in the ranges

123:150

or

1230:1500

, as querying both of these ranges will return me cases that match any instances of both parent codes and child codes.

Upvotes: 1

Views: 3210

Answers (2)

Nick DiQuattro
Nick DiQuattro

Reputation: 739

I think this is a data format problem, so first let's get the data into the right format:'

df <- read.table(text = '       A1    A2    A3    A4    A5   
case1  1234  2234  223   4455  7334
case2  1254  2234  200   NA    NA
case3  1234  9234  NA    NA    NA
case4  1274  5234  228   4655  2534
case5  1234  2634  273   7455  NA')

library(dplyr)
library(tidyr)
newdf <- df %>%
          mutate(case = rownames(df)) %>%
          gather(Anum, code, -case)

> head(newdf)
#   case Anum code
#1 case1   A1 1234
#2 case2   A1 1254
#3 case3   A1 1234
#4 case4   A1 1274
#5 case5   A1 1234
#6 case1   A2 2234

Now you can run filter on just the one column:

filtdf <- filter(newdf, code == 1234)
#  case Anum code
#1 case1   A1 1234
#2 case3   A1 1234
#3 case5   A1 1234

To get your range of codes this should work:

filter(newdf, code > 122 & code < 151)

Upvotes: 1

talat
talat

Reputation: 70266

You could try something along the lines of:

x <- c(123:150, 1230:1500)
filter(df, rowSums(mutate_each(df, funs(. %in% x))) >= 1L)
#   case   A1   A2  A3   A4   A5
#1 case1 1234 2234 223 4455 7334
#2 case2 1254 2234 200   NA   NA
#3 case3 1234 9234  NA   NA   NA
#4 case4 1274 5234 228 4655 2534
#5 case5 1234 2634 273 7455   NA

If you do that for the example you provided (using only 1234 as value to be checked) it would result in what you show above:

filter(df, rowSums(mutate_each(df, funs(. %in% 1234))) >= 1L)
#   case   A1   A2  A3   A4   A5
#1 case1 1234 2234 223 4455 7334
#2 case3 1234 9234  NA   NA   NA
#3 case5 1234 2634 273 7455   NA

The inner function produces a data.frame of logical values:

mutate_each(df, funs(. %in% x))
#   case   A1    A2    A3    A4    A5
#1 FALSE TRUE FALSE FALSE FALSE FALSE
#2 FALSE TRUE FALSE FALSE FALSE FALSE
#3 FALSE TRUE FALSE FALSE FALSE FALSE
#4 FALSE TRUE FALSE FALSE FALSE FALSE
#5 FALSE TRUE FALSE FALSE FALSE FALSE

Which is (almost) the same as when using sapply

sapply(df, function(.) . %in% x)
#      case   A1    A2    A3    A4    A5
#[1,] FALSE TRUE FALSE FALSE FALSE FALSE
#[2,] FALSE TRUE FALSE FALSE FALSE FALSE
#[3,] FALSE TRUE FALSE FALSE FALSE FALSE
#[4,] FALSE TRUE FALSE FALSE FALSE FALSE
#[5,] FALSE TRUE FALSE FALSE FALSE FALSE

And then you use rowSums to check for each row if at least one value is TRUE to subset the data accordingly.

Upvotes: 2

Related Questions