Reputation: 917
I have a html file that contains data which I have to push to MySql database. I try to parse html file to get values I need in scalars which I got right but I have a problem when I get to the point that I need to collect data not from a single line of text but multiple lines between certain patter. Here is what I got so far that kinda works:
#!/usr/bin/perl
binmode STDOUT,':encoding(cp1250)';
open FILE, "index.html" or die "Could not open $file: $!";
my $word;
my $description;
my $origin;
while (my $line = <FILE>)
{
if ($line =~ m/(?<=<h2 class=\"featured\">)(.*)(?=<\/h2>)/)
{
$word = $line =~ m/<=<h2 class=\"featured\">(.*)<\/h2>/;
$word = $1;
}
if ($line =~ m/(?<=<h4 class=\"related-posts\">)/)
{
print $line;
$origin = $line =~ m/<h4 class=\"related-posts\"> <a href=\"..\/tag\/lacina\/index.html\" rel=\"tag\">(.*)<\/a><\/h4>/;
$origin = $1;
}
}
print "$word \n";
print "$origin";
Now I want to grab a few lines of a text - does not have to be in a single scalar but I dont know how many lines there will be. All I know is that the lines are in between of:
<div class="post-content">
<p>text I want</p>
<p>1.text I want</p>
<p>2.text I want</p>
<div class="box small arial">
Plus I would like to get rid of
<p>'s
I thought of reading a line, storing it in a scaral, reading another line and comparing to the recently saved scalar. But how I supouse to check if I have all I want in that scalar?
Upvotes: 1
Views: 542
Reputation: 70732
Use a tool for the job instead of a regular expression.
use strict;
use warnings;
use feature 'say';
use HTML::TreeBuilder;
my $tr = HTML::TreeBuilder->new_from_file('index.html');
for my $div ($tr->look_down(_tag => 'div', 'class' => 'post-content')) {
for my $t ($div->look_down(_tag => 'p')) {
say $t->as_text;
}
}
Output
text I want 1.text I want 2.text I want
Upvotes: 2
Reputation: 35198
use a range operator
to find the text between two patterns:
use strict;
use warnings;
while (<DATA>) {
if (my $range = /<div class="post-content">/ .. /<div class="box small arial">/) {
next if $range =~ /E/;
print;
}
}
__DATA__
<html>
<head><title>stuff</title></head>
<body>
<div class="post-content">
<p>text I want</p>
<p>1.text I want</p>
<p>2.text I want</p>
</div>
<div class="box small arial">
</div>
</body>
</html>
Outputs:
<div class="post-content">
<p>text I want</p>
<p>1.text I want</p>
<p>2.text I want</p>
</div>
However, the real answer is use an actual HTML Parser for parsing HTML.
I'd recommend Mojo::DOM
. For a helpful 8 minute introductory video, check out Mojocast Episode 5
.
use strict;
use warnings;
use Mojo::DOM;
my $data = do {local $/; <DATA>};
my $dom = Mojo::DOM->new($data);
for my $div ($dom->find('div[class=post-content]')->each) {
print $div->all_text();
}
__DATA__
<html>
<head><title>stuff</title></head>
<body>
<div class="post-content">
<p>text I want</p>
<p>1.text I want</p>
<p>2.text I want</p>
</div>
<div class="box small arial">
</div>
</body>
</html>
Outputs:
text I want 1.text I want 2.text I want
Upvotes: 1