HTML::Parser handler sends undefined parameter to callback function?

Question

How its being declared:

my $HTML_GRABBER = HTML::Parser->new('api_version' => 2,
                    'handlers' => { 
                                'start' => [\&start_tag,"tagname,text"],
                                'text' => [\&read_text,"tagname, text"],
                                'end' => [\&end_tag,"tagname"]
                        }
                    );

callback function:

sub read_text {
    print Dumper(@_);
    die "
";
    my ($tag,$stuff) = @_;
    if(($DO_NOTHING==0)&&($tag eq $current_tag))
    {
        push @{$data_queue}, $stuff;
    }
}

result:

$VAR1 = undef;
$VAR2 = '
';

so it passes an undefined value and an empty string for tag and text, apparently. THis is reading from a saved HTML file on my harddrive. IDK

I had something like this in mind:

#DOC structure:
#(
#   "title"=> {"text"=>@("text")}
#   "div" => [
#           {
#               "p"=> [
#                       {
#                           "class" => string
#                           "id" => string
#                           "style" => string
#                           "data"=>["first line", "second line"]
#                       }
#                   ],
#               "class" => string
#               "id" => string
#               "style" => string
#           }
#       ]
#)

Miller · Accepted Answer

You've told it to.

You specified which parameters should be passed to the text handler:

'text' => [\&read_text,"tagname, text"],

Well, there is no tagname for a text token, and therefore it passes you undef as the first paramter.

What exactly are you trying to do? If you describe your actual goal, we might be able to suggest a better solution instead of just pointing out the flaws in your current implementation. Check out: What is an XY Problem?

Addendum about Mojo::DOM

There are modern modules like Mojo::DOM that are much better for navigating a document structure and finding specific data. Check out Mojocast Episode 5 for a helpful 8 minute introductory video.

You appear to be prematurely worried about efficiency of the parse. Initially, I'd advise you to just store the raw html in the database, and reparse it whenever you need to pull new information.

If you Benchmark and decide this is too slow, then you can use Storable to save a serialized copy of the parsed $dom object. However, this should definitely be in addition to the saved html.

use strict;
use warnings;

use Mojo::DOM;
use Storable qw(freeze thaw);

my $dom = Mojo::DOM->new(do {local $/; });

# Serializing to memory - Can then put it into a DB if you want
my $serialized = freeze $dom;
my $newdom = thaw($serialized);

# Load Title from Serialized dom
print $newdom->find('title')->text;

__DATA__

My Title

My Header one
My Paragraph One
My Paragraph Two

Outputs:

My Title

HTML::Parser handler sends undefined parameter to callback function?

Answers (1)

Addendum about Mojo::DOM

Related Questions