Viji
Viji

Reputation: 15

how to search and take particular text in perl

I have one folder it contain 'n' number of html files. I'll read the files and take the one line. (i.e) I'll take the <img /> tag in one array and print the array. Now doesn't print the array. Can you help me. My code is here.

use strict;
use File::Basename;
use File::Path;
use File::Copy;
use Win32::OLE;
use Win32::OLE::Const 'Microsoft Excel';

print "Welcome to PERL program\n";

#print "\n\tProcessing...\n";
my $foldername = $ARGV[0];
opendir(DIR,$foldername) or die("Cannot open the input folder for reading\n");
my (@htmlfiles) = grep/\.html?$/i, readdir(DIR);
closedir(DIR);


@htmlfiles = grep!/(?:index|chapdesc|listdesc|listreview|addform|addform_all|pattern)\.html?$/i,@htmlfiles;
# print "HTML file is @htmlfiles";

my %fileimages;
my $search_for = 'img';
my $htmlstr;
for my $files (@htmlfiles)
{
    if(-e "$foldername\\$files")
    {
        open(HTML, "$foldername\\$files") or die("Cannot open the html files '$files' for reading");
        local undef $/;my $htmlstr=<HTML>;
        close(HTML);
        $fileimages{uc($2)}=[$1,$files] while($htmlstr =~/<img id="([^"]*)" src="\.\/images\/[^t][^\/<>]*\/([^\.]+\.jpg)"/gi);

    }
}

In command prompt.

perl findtext.pl "C:\viji\htmlfiles"

regards, viji

Upvotes: 0

Views: 171

Answers (1)

amon
amon

Reputation: 57640

I would like to point out that parsing HTML with regexes is futile. See the epic https://stackoverflow.com/a/1732454/1521179 for the answer.

Your regex to extract image tags is quite broken. Instead of using a HTML parser and walking the tree, you search for a string that…

/<img id="([^"]*)" src="\.\/images\/[^t][^\/<>]*\/([^\.]+\.jpg)"/gi
  • begins with <img
  • after exactly one space, the sequence id=" is found. The contents of that attribute are captured if it is found, else the match fails. The closing " is consumed.
  • after exactly one space, the sequence src="./images/ is found,
  • followed by a character that is not t. (This allows for ", of course).
  • This is followed by any number of any characters that are not slashes or <> characters (This allows for ", again),
  • followed by a slash.
  • now capture this:
    • one or more characters that are not dots
    • followed by the suffix .jpg
  • after which " has to follow immediately.

false positives

Here is some data that your regex will match, where it shouldn't:

<ImG id="" src="./ImAgEs/s" alt="foo/bar.jpg"

So what is the image path you get? ./ImAgEs/s" alt="foo/bar.jpg may not be what you wanted.

<!-- <iMg id="" src="./images/./foobar.jpg" -->

Oops, I matched commented content. And the path does not contain a subfolder of ./images. The . folder is completely valid in your regex, but denotes the same folder. I could even use .., what would be the folder of your HTML file. Or I could use ./images/./t-rex/image.jpg what would match a forbidden t-folder.

false negatives

Here is some data you would want, but that you won't get:

<img
  id="you-cant-catch-me"
  src='./images/x/awesome.jpg' />

Why? Newlines—but you only allow for single spaces between the parameters. Also, you don't allow for single quotes '

<img src="./images/x/awesome.jpg" id="you-cant-catch-me" />

Why? I now have single spaces, but swapped the arguments. But both these fragments denote the exact same DOM and therefore should be considered equivalent.

Conclusion

go to http://www.cpan.org/ and search for HTML and Tree. Use a module to parse your HTML and walk the tree and extract all matching nodes.

Also, add a print statement somewhere. I found a

 use Data::Dumper;
 print Dumper \%fileimages;

quite enlightening for debug purposes.

Upvotes: 4

Related Questions