Reputation: 1067
I'm trying to split a chunck of html code by the "table" tag and its contents.
So, I tried
my $html = 'aaa<table>test</table>bbb<table>test2</table>ccc';
my @values = split(/<table*.*\/table>/, $html);
After this, I want the @values array to look like this:
array('aaa', 'bbb', 'ccc')
.
But it returns this array:
array('aaa', 'ccc')
.
Can anyone tell me how I can specify to the split function that each table should be parsed separately?
Thank you!
Upvotes: 1
Views: 2238
Reputation: 9697
Maybe using HTML parser is a bit overkill for your example, but it will pay off later when your example grows. Solution using HTML::TreeBuilder:
use HTML::TreeBuilder;
use Data::Dump qw(dd);
my $html = 'aaa<table>test</table>bbb<table>test2</table>ccc';
my $tree = HTML::TreeBuilder->new_from_content($html);
# remove all <table>....</table>
$_->delete for $tree->find('table');
dd($tree->guts); # ("aaa", "bbb", "ccc")
Upvotes: 2
Reputation: 40152
Your regex is greedy, change it to /<table.*?\/table>/
and it will do what you want. But you should really look into a proper HTML parser if you are going to be doing any serious work. A search of CPAN should find one that is suited to your needs.
Upvotes: 4
Reputation: 5714
Use a ?
to specify non-greedy wild-card char slurping, i.e.
my @values = split(/<table*.*?\/table>/, $html);
Upvotes: 2
Reputation: 67900
Your regex .*
is greedy, therefore chewing its way to the last part of the string. Change it to .*?
and it should work better.
Upvotes: 3