Find all occurrences in string containing HTML

Question

I have a string containing HTML like:

s <- "...170 cm...
29...
06/24/1987..."

in which ... means there are other HTML tags in between. I wanna extract the information between > and which can be

only digits
digits and characters (uppercase or lowercase or both)
date of the form mm/dd/yyyy

I came up with something like this for the regex:

">[0-9/]*[a-z ]*[A-Z]*"

Is this correct? How can I extract the values of interest? That is, given s:

170 cm
29
06/24/1987

Dmitry Egorov · Accepted Answer

You'd better go for an HTML parser. But if you need a quick and dirty regex-based solution, use lookarounds to extract a pattern between some opening ((?<=>) for preceding >) and closing ((?=) for trailing ) patterns:

(?<=>)[0-9/A-Za-z ]*(?=)

Please note the 0-9/, a-z, A-Z are combined in one class, otherwise strings like 1 Gb won't match (your original regex requires uppercase letters follow lowercase ones).

The lookarounds are available with perl=TRUE:

m <- gregexpr("(?<=>)[0-9A-Za-z /]*(?=)", s, perl=TRUE)
regmatches(s, m)

Demo: https://ideone.com/yvXIuP

Find all occurrences in string containing HTML

Answers (2)

Related Questions