find first matches for vector of substrings in a vector of strings (duplicates in each)

I have two character vectors x and y, the former comprising (potential) sub strings of the latter, and both containing duplicate values. I want to return the index of the first match (if present) in y for each element in x, where the sub string is matched at the beginning of the string (cf. ^ anchor in regex), e.g:

x <- c("Halimid", "Halimid", "Callimid", "Diplid", "Halimid", "Cyathid")

y <- c("Bathymidae", "Bathymidae", "Halimidopidae", "Cyathidae", "Bothridae", "Cyathidae", "Diplididae", "Holothuridae")

some function(first match for each element of x in y if there is a match)

2, 2, NA, 7, 2, 4

i.e the function should return a vector of same length as x, containing the indices of the first match in y, or NA for elements without a match. I've already tried base::startsWith(), but it only works for a single substring and pmatch() hasn't worked for me either. I want to avoid apply and loops if possible so vectorized solutions preferred

Upvotes: 0

Views: 467

Answers (2)

Karthik S
Karthik S

Reputation: 11548

Using traditional for loop:

v <- NULL
for(chr in x){
   v <- c(v,grep(chr, y)[1])
 }
v
[1]  3  3 NA  7  3  4

Upvotes: 0

Till
Till

Reputation: 6663

I can’t think of a solution without lapply() or purrr::map(), not sure if those are acceptable for you, but they are quite simple, so here we go:

x <- c("Halimid", "Halimid", "Callimid", "Diplid", "Halimid", "Cyathid")

y <- c("Bathymidae", "Bathymidae", "Halimidopidae", "Cyathidae", "Bothridae", "Cyathidae", "Diplididae", "Holothuridae")

Using lapply() and grep().

a <- lapply(x, function(z) grep(z, y)[1])
unlist(a)
#> [1]  3  3 NA  7  3  4

Using map_dbl() we can make the code appear a bit more simple, but it is essentially the same.

library(purrr)
map_dbl(x, ~grep(., y)[1])
#> [1]  3  3 NA  7  3  4

Created on 2020-11-02 by the reprex package (v0.3.0)

Upvotes: 1

Related Questions