exlo
exlo

Reputation: 325

Can someone explain how to design a Regex that works regardless whether there is a space or not after/before HTML tags

Need to use Regex instead of parser to lift attributes from HMTL/XML page, but can't make the Regex <span class='street-address'> (?<Street>.*) lift 2346 21st Ave NE from the following text (spaced exactly like that), in Rubular.

<span class='street-address'>
2346 21st Ave NE
</span>

Also the Regex I have only works if I condense the text and there are spaces after the first HTML tag and before the last HTML tag. If I change the Regex to eliminate those spaces, then spaced HTML tags are skipped. I want to make the Regex as dynamic as possible.

How can I construct a Regex that works regardless whether there is a space or not after/before HTML tags or line breaks?

Upvotes: 0

Views: 47

Answers (1)

Federico Piazza
Federico Piazza

Reputation: 31025

As you can find in almost all the answers related to xhtml and regex, you should not use regex to parse html unless you really know what html content is involved. I would use a html parser instead.

You have just to use the s (single line flag) and also use a lazy quantifier

<span class='street-address'>(?<Street>.*?)<\/span>

Working demo

You can also use the inline s flag like this:

(?s)<span class='street-address'>(?<Street>.*?)<\/span>
 ^--- here

On the other hand, if you don't want to use regex flags, you could use a well know trick by using two opposite sets like [\s\S] like this:

<span class='street-address'>(?<Street>[\s\S]*?)<\/span>

Just for you to know, this trick means:

\s     --> matches whitespace (spaces, tabs). 
\S     --> matches non whitespace (same as: [^\s])
[\s\S] --> matches whitespace or non whitespace (so... everything)

You can use this trick with whatever set you want, like:

[\s\S] whitespace or non whitespace
[\w\W] word or non word
[\d\D] digit or non digit
[\b\B] word boundary or non word boundary

Upvotes: 2

Related Questions