Reputation: 16724
I have a html as the following:
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
</head>
<body bgcolor="white">
<h1>foo.c</h1>
<form method="post" action=""
enctype="application/x-www-form-urlencoded">
Compare this file to the similar file:
<select name="file2">
<option value="...">...</option>
</select>
<input type="hidden" name="file1" value="foo.c" /><br>
Show the results in this format:
</form>
<hr>
<p>
<pre>
some code
</pre>
I need to get value of input name = 'file' and the contents of HTML pre tag. I don't know on perl language, by googling I wrote this small program(that I believe isn't "elegant"):
#!/usr/bin/perl
package MyParser;
use base qw(HTML::Parser);
#Store the file name and contents obtaind from HTML Tags
my($filename, $file_contents);
#This value is set at start() calls
#and use in text() routine..
my($g_tagname, $g_attr);
#Process tag itself and its attributes
sub start {
my ($self, $tagname, $attr, $attrseq, $origtext) = @_;
$g_tagname = $tagname;
$g_attr = $attr;
}
#Process HTML tag body
sub text {
my ($self, $text) = @_;
#Gets the filename
if($g_tagname eq "input" and $g_attr->{'name'} eq "file1") {
$filename = $attr->{'value'};
}
#Gets the filecontents
if($g_tagname eq "pre") {
$file_contents = $text;
}
}
package main;
#read $filename file contents and returns
#note: it works only for text/plain files.
sub read_file {
my($filename) = @_;
open FILE, $filename or die $!;
my ($buf, $data, $n);
while((read FILE, $data, 256) != 0) {
$buf .= $data;
}
return ($buf);
}
my $curr_filename = $ARGV[0];
my $curr_file_contents = read_file($curr_filename);
my $parser = MyParser->new;
$parser->parse($curr_file_contents);
print "filename: ",$filename,"file contents: ",$file_contents;
Then I call ./foo.pl html.html
But I'm getting empty values from $filename
and $file_contents
variables.
How to fix this?
Upvotes: 2
Views: 2849
Reputation: 2668
Like always, there's more than one way to do it. Here's how to use the DOM Parser of Mojolicious for this task:
#!/usr/bin/env perl
use strict;
use warnings;
use Mojo::DOM;
# slurp all lines at once into the DOM parser
my $dom = Mojo::DOM->new(do { local $/; <> });
print $dom->at('input[name=file1]')->attr('value');
print $dom->at('pre')->text;
Output:
foo.c
some code
Upvotes: 6
Reputation: 185254
Using xpath and HTML::TreeBuilder::XPath Perl
module ( very few lines ):
#!/usr/bin/env perl
use strict; use warnings;
use HTML::TreeBuilder::XPath;
my $tree = HTML::TreeBuilder::XPath->new_from_content( <> );
print $tree->findvalue( '//input[@name="file1"]/@value' );
print $tree->findvalue( '//pre/text()' );
USAGE
./script.pl file.html
OUTPUT
foo.c
some code
NOTES
HTML::TreeBuilder
module to do some web-scraping. Now, I can't go back to complexity. HTML::TreeBuilder::XPath
do all the magic with the useful Xpath expressions.new_from_file
method to open a file or a filehandle instead of new_from_content
, see perldoc HTML::TreeBuilder
( HTML::TreeBuilder::XPath
inherit methods from HTML::TreeBuilder
)<>
in this way is allowed here because HTML::TreeBuilder::new_from_content()
specifically allows reading multiple lines in that way. Most constructors will not allow this usage. You should provide a scalar instead or use another method.Upvotes: 5
Reputation: 240030
You don't generally want to use plain HTML::Parser unless you're writing your own parsing module or doing something generally tricky. In this case, HTML::TreeBuilder, which is a subclass of HTML::Parser, is the easiest to use.
Also, note that HTML::Parser has a parse_file
method (and HTML::TreeBuilder makes it even easier with a new_from_file
method, so you don't have to do all of this read_file
business (and besides, there are better ways to do it than the one you picked, including File::Slurp
and the old do { local $/; <$handle> }
trick.
use HTML::TreeBuilder;
my $filename = $ARGV[0];
my $tree = HTML::TreeBuilder->new_from_file($filename);
my $filename = $tree->look_down(
_tag => 'input',
type => 'hidden',
name => 'file1'
)->attr('value');
my $file_contents = $tree->look_down(_tag => 'pre')->as_trimmed_text;
print "filename: ",$filename,"file contents: ",$file_contents;
For information on look_down
, attr
, and as_trimmed_text
, see the HTML::Element docs; HTML::TreeBuilder both is a, and works with, elements.
Upvotes: 4