Reputation: 1
I'm trying to delete all table elements from several HTML files.
The following code runs perfectly on a single file, but when trying to automate the process it returns the error
can't call method "look_down" on an undefined value
Do you have any solution please?
Here is the code:
use strict;
use warnings;
use Path::Class;
use HTML::TreeBuilder;
opendir( DH, "C:/myfiles" );
my @files = readdir(DH);
closedir(DH);
foreach my $file ( @files ) {
print("Analyzing file $file\n");
my $tree = HTML::TreeBuilder->new->parse_file("C:/myfiles/$file");
foreach my $e ( $tree->look_down( _tag => "table" ) ) {
$e->delete();
}
use HTML::FormatText;
my $formatter = HTML::FormatText->new;
my $parsed = $formatter->format($tree);
print $parsed;
}
Upvotes: 0
Views: 215
Reputation: 126722
The problem is that you're feeding HTML::TreeBuilder
all sorts of junk in addition to the HTML files that you intend. As well as any files in the opened directory, readdir
returns the names of all subdirectories, as well as the pseudo-directories .
and ..
. You should have seen this in the output from your print
statement
print("Analyzing file $file\n");
One way to fix this is to check that each value in the loop is a file before processing it. Something like this
for my $file ( @files ) {
my $path = "C:/myfiles/$file";
next unless -f $path;
print("Analyzing file $file\n");
my $tree = HTML::TreeBuilder->new->parse_file($path);
for my $table ( $tree->look_down( _tag => 'table' ) ) {
$table->delete();
}
...;
}
But it would be much cleaner to use a call to glob
. That way you will only get the files that you want, and there is also no need to build the full path to each file
That would look something like this. You would have to adjust the glob pattern if your files don't all end with .html
for my $path ( glob "C:/myfiles/*.html" ) {
print("Analyzing file $path\n");
my $tree = HTML::TreeBuilder->new->parse_file($path);
for my $table ( $tree->look_down( _tag => 'table' ) ) {
$table->delete();
}
...;
}
Strictly speaking, a directory name may also look like *.html
, and if you don't trust your file structure you should also test that each result of glob
is a file before processing it. But in normal situations where you know what's in the directory you're processing that isn't necessary
Upvotes: 1