jason dancks
jason dancks

Reputation: 1142

HTML::Parser handler sends undefined parameter to callback function?

How its being declared:

my $HTML_GRABBER = HTML::Parser->new('api_version' => 2,
                    'handlers' => { 
                                'start' => [\&start_tag,"tagname,text"],
                                'text' => [\&read_text,"tagname, text"],
                                'end' => [\&end_tag,"tagname"]
                        }
                    );

callback function:

sub read_text {
    print Dumper(@_);
    die "\n";
    my ($tag,$stuff) = @_;
    if(($DO_NOTHING==0)&&($tag eq $current_tag))
    {
        push @{$data_queue}, $stuff;
    }
}

result:

$VAR1 = undef;
$VAR2 = '
';

so it passes an undefined value and an empty string for tag and text, apparently. THis is reading from a saved HTML file on my harddrive. IDK

I had something like this in mind:

#DOC structure:
#(
#   "title"=> {"text"=>@("text")}
#   "div" => [
#           {
#               "p"=> [
#                       {
#                           "class" => string
#                           "id" => string
#                           "style" => string
#                           "data"=>["first line", "second line"]
#                       }
#                   ],
#               "class" => string
#               "id" => string
#               "style" => string
#           }
#       ]
#)

Upvotes: 0

Views: 107

Answers (1)

Miller
Miller

Reputation: 35208

You've told it to.

You specified which parameters should be passed to the text handler:

'text' => [\&read_text,"tagname, text"],

Well, there is no tagname for a text token, and therefore it passes you undef as the first paramter.

What exactly are you trying to do? If you describe your actual goal, we might be able to suggest a better solution instead of just pointing out the flaws in your current implementation. Check out: What is an XY Problem?

Addendum about Mojo::DOM

There are modern modules like Mojo::DOM that are much better for navigating a document structure and finding specific data. Check out Mojocast Episode 5 for a helpful 8 minute introductory video.

You appear to be prematurely worried about efficiency of the parse. Initially, I'd advise you to just store the raw html in the database, and reparse it whenever you need to pull new information.

If you Benchmark and decide this is too slow, then you can use Storable to save a serialized copy of the parsed $dom object. However, this should definitely be in addition to the saved html.

use strict;
use warnings;

use Mojo::DOM;
use Storable qw(freeze thaw);

my $dom = Mojo::DOM->new(do {local $/; <DATA>});

# Serializing to memory - Can then put it into a DB if you want
my $serialized = freeze $dom;
my $newdom = thaw($serialized);

# Load Title from Serialized dom
print $newdom->find('title')->text;

__DATA__
<html>
<head><title>My Title</title></head>
<body>
<h1>My Header one</h1>
<p>My Paragraph One</p>
<p>My Paragraph Two</p>
</body>
</html>

Outputs:

My Title

Upvotes: 1

Related Questions