Reputation: 6874
I am trying to extract a pattern from some html. I am using stringr.
I have a list of numbers such as nums<-c(">00324R<",">E223143<",">00000F<")
I have tried str_extract(nums,">[A-Z0-9]{4,}?<")
which extracts the pattern I want( ">00324R<" ">E223143<" ">00000F<"
) but I don't want to extract < or >
I am aware that positive lookahead may be the answer here but I don't seem to be able to create one that works and I'm not sure why not. I have tried:
str_extract(nums,"(?<=<)[A-Z0-9]{4,}?<")
Upvotes: 2
Views: 245
Reputation: 15
You could remove "<" or ">" as follows:
gsub("[<| >]", ''", nums)
Upvotes: 0
Reputation: 33488
If your strings are so consistent, you could just select anything that is not >
or <
:
str_extract(nums, "[^<>]+")
# [1] "00324R" "E223143" "00000F"
Or gsub()
them away:
gsub("[<|>]", "", nums)
Upvotes: 2
Reputation: 626870
Use the following regex
> str_extract(nums,"(?<=>)[A-Z0-9]{4,}(?=<)")
[1] "00324R" "E223143" "00000F"
Details
(?<=>)
- a positive lookbehind that matches a location immediately preceded with >
[A-Z0-9]{4,}
- four or more uppercase ASCII letters or digits(?=<)
- a positive lookahead that matches a location immediately followed with <
.See the regex demo and the regex graph:
Upvotes: 3