Reputation: 626
OK, I have a little problem which I believe I can solve with which
and grepl
(alternatives are welcome), but I am getting lost:
my_query<- c('g1', 'g2', 'g3')
my_data<- c('string2','string4','string5','string6')
I would like to return the index in my_query
matching in my_data
. In the example above, only 'g2' is in mydata
, so the result in the example would be 2
.
Upvotes: 2
Views: 177
Reputation: 23798
Expanding on a comment posted initially by @Gregor you could try:
which(colSums(sapply(my_query, grepl, my_data)) > 0)
#g2
# 2
The function colSums
is vectorized and represents no problem in terms of performance. The sapply()
loop seems inevitable here, since we need to check each element within the query vector. The result of the loop is a logical matrix, with each column representing an element of my_query
and each row an element of my_data
. By wrapping this matrix into which(colSums(..) > 0)
we obtain the index numbers of all columns that contain at least one TRUE
, i.e., a match with an entry of my_data
.
Upvotes: 2
Reputation: 73385
It seems to me that there is no easy way to do this without a loop. For each element in my_query
, we can use either of the below functions to get TRUE
or FALSE
:
f1 <- function (pattern, x) length(grep(pattern, x)) > 0L
f2 <- function (pattern, x) any(grepl(pattern, x))
For example,
f1(my_query[1], my_data)
# [1] FALSE
f2(my_query[1], my_data)
# [1] FALSE
Then, we use *apply
loop to apply, say f2
to all elements of my_query
:
which(unlist(lapply(my_query, f2, x = my_data)))
# [1] 2
Thanks, that seems to work. To be honest, I preferred to your one-line original version. I am not sure why you edited with creating another function to call afterwards with
*apply
. Is there any advantage as compared towhich(lengths(lapply(my_query, grep, my_data)) > 0L)
?
Well, I am not entirely sure. When I read ?lengths
:
One advantage of ‘lengths(x)’ is its use as a more efficient
version of ‘sapply(x, length)’ and similar ‘*apply’ calls to
‘length’.
I don't know how much more efficient that lengths
is compared with sapply
. Anyway, if it is still a loop, then my original suggestion which(lengths(lapply(my_query, grep, my_data)) > 0L)
is performing 2 loops. My edit is essentially combining two loops together, hopefully to get some boost (if not too tiny).
You can still arrange my new edit into a single line:
which(unlist(lapply(my_query, function (pattern, x) any(grepl(pattern, x)), x = my_data)))
or
which(unlist(lapply(my_query, function (pattern) any(grepl(pattern, my_data)))))
Upvotes: 5