Reputation: 325
Need to use Regex instead of parser to lift attributes from HMTL/XML page, but can't make the Regex <span class='street-address'> (?<Street>.*)
lift 2346 21st Ave NE
from the following text (spaced exactly like that), in Rubular.
<span class='street-address'>
2346 21st Ave NE
</span>
Also the Regex I have only works if I condense the text and there are spaces after the first HTML tag and before the last HTML tag. If I change the Regex to eliminate those spaces, then spaced HTML tags are skipped. I want to make the Regex as dynamic as possible.
How can I construct a Regex that works regardless whether there is a space or not after/before HTML tags or line breaks?
Upvotes: 0
Views: 47
Reputation: 31025
As you can find in almost all the answers related to xhtml and regex, you should not use regex to parse html unless you really know what html content is involved. I would use a html parser instead.
You have just to use the s
(single line flag) and also use a lazy quantifier
<span class='street-address'>(?<Street>.*?)<\/span>
You can also use the inline s
flag like this:
(?s)<span class='street-address'>(?<Street>.*?)<\/span>
^--- here
On the other hand, if you don't want to use regex flags, you could use a well know trick by using two opposite sets like [\s\S]
like this:
<span class='street-address'>(?<Street>[\s\S]*?)<\/span>
Just for you to know, this trick means:
\s --> matches whitespace (spaces, tabs).
\S --> matches non whitespace (same as: [^\s])
[\s\S] --> matches whitespace or non whitespace (so... everything)
You can use this trick with whatever set you want, like:
[\s\S] whitespace or non whitespace
[\w\W] word or non word
[\d\D] digit or non digit
[\b\B] word boundary or non word boundary
Upvotes: 2