989
989

Reputation: 12935

Find all occurrences in string containing HTML

I have a string containing HTML like:

s <- "...<span class=\"pull-right\">170 cm</span>...
<span class=\"pull-right\">29</span>...
<span class=\"pull-right\">06/24/1987</span>..."

in which ... means there are other HTML tags in between. I wanna extract the information between > and </span> which can be

I came up with something like this for the regex:

">[0-9/]*[a-z ]*[A-Z]*</span>"

Is this correct? How can I extract the values of interest? That is, given s:

170 cm
29
06/24/1987

Upvotes: 0

Views: 72

Answers (2)

Dmitry Egorov
Dmitry Egorov

Reputation: 9650

You'd better go for an HTML parser. But if you need a quick and dirty regex-based solution, use lookarounds to extract a pattern between some opening ((?<=>) for preceding >) and closing ((?=</span>) for trailing </span>) patterns:

(?<=>)[0-9/A-Za-z ]*(?=</span>)

Please note the 0-9/, a-z, A-Z are combined in one class, otherwise strings like 1 Gb won't match (your original regex requires uppercase letters follow lowercase ones).

The lookarounds are available with perl=TRUE:

m <- gregexpr("(?<=>)[0-9A-Za-z /]*(?=</span>)", s, perl=TRUE)
regmatches(s, m)

Demo: https://ideone.com/yvXIuP

Upvotes: 1

Simo
Simo

Reputation: 195

Here is a regex that matches

170 cm

29

06/24/1987

(\d{2}\/\d{2}\/\d{4})|(\d+ [A-Za-z]+)|(\d+)

Upvotes: 0

Related Questions