Reputation: 6874

How to extract only the capture group in a regex in R

I am trying to extract a pattern from some html. I am using stringr.

I have a list of numbers such as nums<-c(">00324R<",">E223143<",">00000F<") I have tried str_extract(nums,">[A-Z0-9]{4,}?<") which extracts the pattern I want( ">00324R<" ">E223143<" ">00000F<" ) but I don't want to extract < or >

I am aware that positive lookahead may be the answer here but I don't seem to be able to create one that works and I'm not sure why not. I have tried:

str_extract(nums,"(?<=<)[A-Z0-9]{4,}?<")

Upvotes: 2

Answers (3)

user12864379

Reputation: 15

You could remove "<" or ">" as follows:

gsub("[<| >]", ''", nums)

Upvotes: 0

s_baldur

Reputation: 33488

If your strings are so consistent, you could just select anything that is not > or <:

str_extract(nums, "[^<>]+")
# [1] "00324R"  "E223143" "00000F"

Or gsub() them away:

gsub("[<|>]", "", nums)

Upvotes: 2

Wiktor Stribiżew

Reputation: 626870

Use the following regex

> str_extract(nums,"(?<=>)[A-Z0-9]{4,}(?=<)")
[1] "00324R"  "E223143" "00000F"

Details

(?<=>) - a positive lookbehind that matches a location immediately preceded with >
[A-Z0-9]{4,} - four or more uppercase ASCII letters or digits
(?=<) - a positive lookahead that matches a location immediately followed with <.

See the regex demo and the regex graph:

Upvotes: 3

How to extract only the capture group in a regex in R

Answers (3)

Related Questions