Reputation: 15
I have one folder it contain 'n' number of html files. I'll read the files and take the one line. (i.e) I'll take the <img />
tag in one array and print the array. Now doesn't print the array. Can you help me. My code is here.
use strict;
use File::Basename;
use File::Path;
use File::Copy;
use Win32::OLE;
use Win32::OLE::Const 'Microsoft Excel';
print "Welcome to PERL program\n";
#print "\n\tProcessing...\n";
my $foldername = $ARGV[0];
opendir(DIR,$foldername) or die("Cannot open the input folder for reading\n");
my (@htmlfiles) = grep/\.html?$/i, readdir(DIR);
closedir(DIR);
@htmlfiles = grep!/(?:index|chapdesc|listdesc|listreview|addform|addform_all|pattern)\.html?$/i,@htmlfiles;
# print "HTML file is @htmlfiles";
my %fileimages;
my $search_for = 'img';
my $htmlstr;
for my $files (@htmlfiles)
{
if(-e "$foldername\\$files")
{
open(HTML, "$foldername\\$files") or die("Cannot open the html files '$files' for reading");
local undef $/;my $htmlstr=<HTML>;
close(HTML);
$fileimages{uc($2)}=[$1,$files] while($htmlstr =~/<img id="([^"]*)" src="\.\/images\/[^t][^\/<>]*\/([^\.]+\.jpg)"/gi);
}
}
In command prompt.
perl findtext.pl "C:\viji\htmlfiles"
regards, viji
Upvotes: 0
Views: 171
Reputation: 57640
I would like to point out that parsing HTML with regexes is futile. See the epic https://stackoverflow.com/a/1732454/1521179 for the answer.
Your regex to extract image tags is quite broken. Instead of using a HTML parser and walking the tree, you search for a string that…
/<img id="([^"]*)" src="\.\/images\/[^t][^\/<>]*\/([^\.]+\.jpg)"/gi
<img
id="
is found. The contents of that attribute are captured if it is found, else the match fails. The closing "
is consumed.src="./images/
is found,t
. (This allows for "
, of course).<>
characters (This allows for "
, again),.jpg
"
has to follow immediately.Here is some data that your regex will match, where it shouldn't:
<ImG id="" src="./ImAgEs/s" alt="foo/bar.jpg"
So what is the image path you get? ./ImAgEs/s" alt="foo/bar.jpg
may not be what you wanted.
<!-- <iMg id="" src="./images/./foobar.jpg" -->
Oops, I matched commented content. And the path does not contain a subfolder of ./images
. The .
folder is completely valid in your regex, but denotes the same folder. I could even use ..
, what would be the folder of your HTML file. Or I could use ./images/./t-rex/image.jpg
what would match a forbidden t
-folder.
Here is some data you would want, but that you won't get:
<img
id="you-cant-catch-me"
src='./images/x/awesome.jpg' />
Why? Newlines—but you only allow for single spaces between the parameters. Also, you don't allow for single quotes '
<img src="./images/x/awesome.jpg" id="you-cant-catch-me" />
Why? I now have single spaces, but swapped the arguments. But both these fragments denote the exact same DOM and therefore should be considered equivalent.
go to http://www.cpan.org/ and search for HTML
and Tree
. Use a module to parse your HTML and walk the tree and extract all matching nodes.
Also, add a print
statement somewhere. I found a
use Data::Dumper;
print Dumper \%fileimages;
quite enlightening for debug purposes.
Upvotes: 4