Max_IT
Max_IT

Reputation: 626

getting lost in Using which() and regex in R

OK, I have a little problem which I believe I can solve with which and grepl (alternatives are welcome), but I am getting lost:

my_query<- c('g1', 'g2', 'g3')
my_data<- c('string2','string4','string5','string6')

I would like to return the index in my_query matching in my_data. In the example above, only 'g2' is in mydata, so the result in the example would be 2.

Upvotes: 2

Views: 177

Answers (2)

RHertel
RHertel

Reputation: 23798

Expanding on a comment posted initially by @Gregor you could try:

which(colSums(sapply(my_query, grepl, my_data)) > 0)
#g2 
# 2 

The function colSums is vectorized and represents no problem in terms of performance. The sapply() loop seems inevitable here, since we need to check each element within the query vector. The result of the loop is a logical matrix, with each column representing an element of my_query and each row an element of my_data. By wrapping this matrix into which(colSums(..) > 0) we obtain the index numbers of all columns that contain at least one TRUE, i.e., a match with an entry of my_data.

Upvotes: 2

Zheyuan Li
Zheyuan Li

Reputation: 73385

It seems to me that there is no easy way to do this without a loop. For each element in my_query, we can use either of the below functions to get TRUE or FALSE:

f1 <- function (pattern, x) length(grep(pattern, x)) > 0L

f2 <- function (pattern, x) any(grepl(pattern, x))

For example,

f1(my_query[1], my_data)
# [1] FALSE
f2(my_query[1], my_data)
# [1] FALSE

Then, we use *apply loop to apply, say f2 to all elements of my_query:

which(unlist(lapply(my_query, f2, x = my_data)))
# [1] 2

Thanks, that seems to work. To be honest, I preferred to your one-line original version. I am not sure why you edited with creating another function to call afterwards with *apply. Is there any advantage as compared to which(lengths(lapply(my_query, grep, my_data)) > 0L)?

Well, I am not entirely sure. When I read ?lengths:

 One advantage of ‘lengths(x)’ is its use as a more efficient
 version of ‘sapply(x, length)’ and similar ‘*apply’ calls to
 ‘length’.

I don't know how much more efficient that lengths is compared with sapply. Anyway, if it is still a loop, then my original suggestion which(lengths(lapply(my_query, grep, my_data)) > 0L) is performing 2 loops. My edit is essentially combining two loops together, hopefully to get some boost (if not too tiny).

You can still arrange my new edit into a single line:

which(unlist(lapply(my_query, function (pattern, x) any(grepl(pattern, x)), x = my_data)))

or

which(unlist(lapply(my_query, function (pattern) any(grepl(pattern, my_data)))))

Upvotes: 5

Related Questions