Adam Robinsson
Adam Robinsson

Reputation: 1751

Test for multiple strings in multiple columns

I've searched extensively on stack overflow without finding an answer to the following question: Im looking for a function that lets me test if some columns contain any of the specified strings.

# I'm looking for heart attacks
infarction <- c("b", "c")
# I'm also looking for strokes
stroke <- c("h", "i")

#sample data set
set.seed(1234)
dat <- data.frame(A = sample(letters[1:9],10,TRUE),
                  B = sample(letters[1:9],10,TRUE),
                  C = sample(letters[1:9],10,TRUE),
                  D = sample(letters[1:9],10,TRUE),
                  DATE = sample.int(10, size = 10, replace = FALSE))

# I've tried many things. Among them:
# first one using the dplyr package.

infarction = ifelse( (infarction  %in% dat[,c("A", "B", "C", "D")]), DATE, NA))

#excluded a few rows from the mutate...

#I've also tried
grep(paste(infarction,collapse="|"), dat[,1:4], value=TRUE), DATE, NA))

To sum up. I can get it to work if i only check in one column or only look after one of the strings. But I need to check if any of the strings is contained in any of the given columns, and in that case return the date value as a new variable.

Any help is greatly appreciated.

Upvotes: 1

Views: 1765

Answers (3)

rnso
rnso

Reputation: 24535

Following may help:

> apply(dat, 1, function(x) ifelse(infarction %in% x[1:5],x[5],NA  ))
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] " 6" NA   " 3" NA   " 2" NA   " 7" NA   " 1" NA   
[2,] " 6" "10" " 3" NA   " 2" NA   " 7" " 9" NA   " 4" 

In data.frame form:

> as.data.frame(t(apply(dat, 1, function(x) ifelse(infarction %in% x[1:5],x[5],NA  ))))

     V1   V2
1     6    6
2  <NA>   10
3     3    3
4  <NA> <NA>
5     2    2
6  <NA> <NA>
7     7    7
8  <NA>    9
9     1 <NA>
10 <NA>    4

For stroke:

> as.data.frame(t(apply(dat, 1, function(x) ifelse(stroke %in% x[1:5],x[5],NA  ))))

     V1   V2
1  <NA> <NA>
2  <NA> <NA>
3  <NA> <NA>
4  <NA>    5
5     2 <NA>
6     8 <NA>
7  <NA> <NA>
8  <NA>    9
9     1    1
10    4 <NA>

Upvotes: 1

konvas
konvas

Reputation: 14346

It depends on what your desired output is. @Pop's answer is perfectly good I just want to add that your grep approach was on the right track except you can't give it a data frame in the x argument. You have to perform it column-wise. This is because grep will coerce the x argument to a character and you can see if you type as.character(dat[, 1:4]) that you are greping the wrong string. Instead, try for example

grepres <- lapply(dat[1:4], grep, pattern = paste(infarction,collapse="|"))

If you want to find which rows don't contain the pattern you can do for example

idx <- setdiff(1:nrow(dat), unlist(grepres))

and then

result <- rep(NA, nrow(dat))
result[idx] <- dat$DATE[idx]

Upvotes: 2

Pop
Pop

Reputation: 12401

If you just want to know if the columns contain the letters, you can apply to each column of your data.frame the %in% function:

apply(dat[,-5], 2, '%in%', x=stroke)

If you want to know the position of each letter of stroke in your columns, you can use the which function and apply it to each column:

apply(dat[,-5], 2, function(table,x) which(x==table), x=stroke[1])
apply(dat[,-5], 2, function(table,x) which(x==table), x=stroke[2])

or

sapply(stroke, function(y) apply(dat, 2, function(table,x) which(x==table), x=y), simplify=FALSE)

If you want to know the position of all letters of stroke, do this:

apply(dat[,-5], 2, function(x) unlist(sapply(stroke , function(i) which( i == x))))

Upvotes: 3

Related Questions