Reputation: 12935
I have a string containing HTML like:
s <- "...<span class=\"pull-right\">170 cm</span>...
<span class=\"pull-right\">29</span>...
<span class=\"pull-right\">06/24/1987</span>..."
in which ...
means there are other HTML tags in between. I wanna extract the information between >
and </span>
which can be
mm/dd/yyyy
I came up with something like this for the regex:
">[0-9/]*[a-z ]*[A-Z]*</span>"
Is this correct? How can I extract the values of interest? That is, given s
:
170 cm
29
06/24/1987
Upvotes: 0
Views: 72
Reputation: 9650
You'd better go for an HTML parser. But if you need a quick and dirty regex-based solution, use lookarounds to extract a pattern between some opening ((?<=>)
for preceding >
) and closing ((?=</span>)
for trailing </span>
) patterns:
(?<=>)[0-9/A-Za-z ]*(?=</span>)
Please note the 0-9/
, a-z
, A-Z
are combined in one class, otherwise strings like 1 Gb
won't match (your original regex requires uppercase letters follow lowercase ones).
The lookarounds are available with perl=TRUE
:
m <- gregexpr("(?<=>)[0-9A-Za-z /]*(?=</span>)", s, perl=TRUE)
regmatches(s, m)
Demo: https://ideone.com/yvXIuP
Upvotes: 1
Reputation: 195
Here is a regex that matches
170 cm
29
06/24/1987
(\d{2}\/\d{2}\/\d{4})|(\d+ [A-Za-z]+)|(\d+)
Upvotes: 0