Can someone explain how to design a Regex that works regardless whether there is a space or not after/before HTML tags

Question

Need to use Regex instead of parser to lift attributes from HMTL/XML page, but can't make the Regex (?.*) lift 2346 21st Ave NE from the following text (spaced exactly like that), in Rubular.


2346 21st Ave NE

Also the Regex I have only works if I condense the text and there are spaces after the first HTML tag and before the last HTML tag. If I change the Regex to eliminate those spaces, then spaced HTML tags are skipped. I want to make the Regex as dynamic as possible.

How can I construct a Regex that works regardless whether there is a space or not after/before HTML tags or line breaks?

Federico Piazza · Accepted Answer

As you can find in almost all the answers related to xhtml and regex, you should not use regex to parse html unless you really know what html content is involved. I would use a html parser instead.

You have just to use the s (single line flag) and also use a lazy quantifier

(?.*?)<\/span>

Working demo

You can also use the inline s flag like this:

(?s)(?.*?)<\/span>
 ^--- here

On the other hand, if you don't want to use regex flags, you could use a well know trick by using two opposite sets like [\s\S] like this:

(?[\s\S]*?)<\/span>

Just for you to know, this trick means:

\s     --> matches whitespace (spaces, tabs). 
\S     --> matches non whitespace (same as: [^\s])
[\s\S] --> matches whitespace or non whitespace (so... everything)

You can use this trick with whatever set you want, like:

[\s\S] whitespace or non whitespace
[\w\W] word or non word
[\d\D] digit or non digit
[\b\B] word boundary or non word boundary

Can someone explain how to design a Regex that works regardless whether there is a space or not after/before HTML tags

Answers (1)

Related Questions