match substring from another list of all possible substrings

Question

I have a long vector of strings containing a market name and other stuff

S = c('123_GOLD_534', '531_SILVER_dfds', '93_COPPER_29dad', '452_GOLD_deww')

and another vector contains all the possible markets

V = c('GOLD','SILVER')

How can I extract the market name bit from S? Basically I want to loop over V and S, replace S[j] with V[i] if grepl(V[i], S[j]).

So the result should look like

c('GOLD','SILVER',NA,'GOLD')

Wiktor Stribiżew · Accepted Answer

You may use str_extract from stringr:

> library(stringr)
> str_extract(S, paste(V, collapse="|"))
[1] "GOLD"   "SILVER" NA       "GOLD"

The paste(V, collapse="|") will create a regex like GOLD|SILVER and will thus extract GOLD or SILVER. If the regex does not match, it will just return NA.

Note that if you need to match GOLD or SILVER only when enclosed with _ symbols, replace paste(V, collapse="|") with paste0("(?<=_)(?:", paste(V, collapse="|"), ")(?=_)"):

> str_extract(S, paste0("(?<=_)(?:", paste(V, collapse="|"), ")(?=_)"))
[1] "GOLD"   "SILVER" NA       "GOLD"

It will create a regex like (?<=_)(?:GOLD|SILVER)(?=_) and will only match GOLD or SILVER if there is a _ in front ((?<=_), a positive lookbehind) and if there is a _ after the value (due to the (?=_) positive lookahead). Lookaheads do not add matched text to the match (they are non-consuming).

match substring from another list of all possible substrings

Answers (1)

Related Questions