Reputation: 1142
How its being declared:
my $HTML_GRABBER = HTML::Parser->new('api_version' => 2,
'handlers' => {
'start' => [\&start_tag,"tagname,text"],
'text' => [\&read_text,"tagname, text"],
'end' => [\&end_tag,"tagname"]
}
);
callback function:
sub read_text {
print Dumper(@_);
die "\n";
my ($tag,$stuff) = @_;
if(($DO_NOTHING==0)&&($tag eq $current_tag))
{
push @{$data_queue}, $stuff;
}
}
result:
$VAR1 = undef;
$VAR2 = '
';
so it passes an undefined value and an empty string for tag and text, apparently. THis is reading from a saved HTML file on my harddrive. IDK
I had something like this in mind:
#DOC structure:
#(
# "title"=> {"text"=>@("text")}
# "div" => [
# {
# "p"=> [
# {
# "class" => string
# "id" => string
# "style" => string
# "data"=>["first line", "second line"]
# }
# ],
# "class" => string
# "id" => string
# "style" => string
# }
# ]
#)
Upvotes: 0
Views: 107
Reputation: 35208
You've told it to.
You specified which parameters should be passed to the text handler:
'text' => [\&read_text,"tagname, text"],
Well, there is no tagname
for a text token, and therefore it passes you undef
as the first paramter.
What exactly are you trying to do? If you describe your actual goal, we might be able to suggest a better solution instead of just pointing out the flaws in your current implementation. Check out: What is an XY Problem?
There are modern modules like Mojo::DOM
that are much better for navigating a document structure and finding specific data. Check out Mojocast Episode 5
for a helpful 8 minute introductory video.
You appear to be prematurely worried about efficiency of the parse. Initially, I'd advise you to just store the raw html in the database, and reparse it whenever you need to pull new information.
If you Benchmark
and decide this is too slow, then you can use Storable
to save a serialized copy of the parsed $dom object. However, this should definitely be in addition to the saved html.
use strict;
use warnings;
use Mojo::DOM;
use Storable qw(freeze thaw);
my $dom = Mojo::DOM->new(do {local $/; <DATA>});
# Serializing to memory - Can then put it into a DB if you want
my $serialized = freeze $dom;
my $newdom = thaw($serialized);
# Load Title from Serialized dom
print $newdom->find('title')->text;
__DATA__
<html>
<head><title>My Title</title></head>
<body>
<h1>My Header one</h1>
<p>My Paragraph One</p>
<p>My Paragraph Two</p>
</body>
</html>
Outputs:
My Title
Upvotes: 1