Frank H.
Frank H.

Reputation: 896

How best to parse/split a <div> based on <br /> tags

I have a tag containing a multiline address that I'd like to split into single lines so that I can identify the city, postcode, etc.

For example

<div>Ministry of Magic
    <br />Whitehall
    <br />London
    <br />SW1A 2AA
</div>

I can do it no problem with the split function, for example (assuming the address div is in the variable $text)

use feature 'say';
my @lines = split qr{<br\s?/>}, $text;
foreach my $line (@lines) {
    say $line;
}

displays

Ministry of Magic
Whitehall
London
SW1A 2AA

However, I'm well aware that using a regex to parse HTML is verboten so I thought I'd give it a try using HTML::TreeBuilder / HTML::Element but I'm not sure how to grab the content. I can do a look_down for the 'br' tags, but it only returns the <br /> tags themselves. This is not surprising because a <br> element cannot contain content, but I don't know what syntax to use instead.

my $tree = HTML::TreeBuilder->new();
my @content = $tree->parse($text)->guts()->look_down(_tag => 'br');
foreach my $line (@content) {
    say $line->as_HTML;
}

displays

<br />
<br />
<br />

So, my questions are: 1) should I stick with the regex or use HTML::TreeBuilder, and 2) if I should use HTML::TreeBuilder, how I can I extract the four lines of text I'm interested in?

Upvotes: 4

Views: 595

Answers (1)

type_outcast
type_outcast

Reputation: 635

If your case is (and will always be) as simple as you describe, then I'd stick with the regexes. Before you cry havoc and release the dogs on me, think for a second:

Yes, it's true that regexes cannot parse HTML. But we are not parsing HTML here. We are parsing a very, very tiny subset of HTML within a <div>, which is easily handled by a simple regex. Using a full-blown parsing library would, to me, be rather like using a sledgehammer to crack a walnut.

I would personally upgrade your regex a bit to m!<\s*br\s*/?\s*>! to catch (slightly) mangled HTML, and, as with anything, test with every valid and invalid input you can put together.

Upvotes: 2

Related Questions