jf328
jf328

Reputation: 7351

match substring from another list of all possible substrings

I have a long vector of strings containing a market name and other stuff

S = c('123_GOLD_534', '531_SILVER_dfds', '93_COPPER_29dad', '452_GOLD_deww')

and another vector contains all the possible markets

V = c('GOLD','SILVER')

How can I extract the market name bit from S? Basically I want to loop over V and S, replace S[j] with V[i] if grepl(V[i], S[j]).

So the result should look like

c('GOLD','SILVER',NA,'GOLD')

Upvotes: 2

Views: 398

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627468

You may use str_extract from stringr:

> library(stringr)
> str_extract(S, paste(V, collapse="|"))
[1] "GOLD"   "SILVER" NA       "GOLD"  

The paste(V, collapse="|") will create a regex like GOLD|SILVER and will thus extract GOLD or SILVER. If the regex does not match, it will just return NA.

Note that if you need to match GOLD or SILVER only when enclosed with _ symbols, replace paste(V, collapse="|") with paste0("(?<=_)(?:", paste(V, collapse="|"), ")(?=_)"):

> str_extract(S, paste0("(?<=_)(?:", paste(V, collapse="|"), ")(?=_)"))
[1] "GOLD"   "SILVER" NA       "GOLD"  

It will create a regex like (?<=_)(?:GOLD|SILVER)(?=_) and will only match GOLD or SILVER if there is a _ in front ((?<=_), a positive lookbehind) and if there is a _ after the value (due to the (?=_) positive lookahead). Lookaheads do not add matched text to the match (they are non-consuming).

Upvotes: 4

Related Questions