capser
capser

Reputation: 2635

parsing html with regex - using capture and $1 parameter

i just want to say that i understand that you can't parse HTML with regexes. i get that. you cannot parse HTML with regex.

I am just getting a few urls from a webpage.

the output is a little strange - there a a new line after the closing anchor tag.

<A HREF="tmtrack.dll?IssuePage&SolutionId=8&RecordId=20193&Template=view&TableId
=1023"><B>26165</B></A>

<A HREF="tmtrack.dll?IssuePage&SolutionId=8&RecordId=21811&Template=view&TableId
=1023"><B>28722</B></A>

<A HREF="tmtrack.dll?IssuePage&SolutionId=8&RecordId=22163&Template=view&TableId
 =1023"><B>29327</B></A>

<A HREF="tmtrack.dll?IssuePage&SolutionId=8&RecordId=22238&Template=view&TableId
=1023"><B>29450</B></A>

So i write this little script to make it neater.

#!/usr/bin/perl
use strict;
use warnings ;
my $list = "/tmp/rawurl_list";
open( my $filehandle ,"<", "$list") or die $!;
while (<$filehandle>) {
    s/\n//g;
    s/\<\/A\>/\n/g;
    print $_ ;
        if ($_ =~ /^<A HREF="(.*)"/) {
           print $1;
        }
}

and this is what i get

<A HREF="tmtrack.dll?  IssuePage&SolutionId=8&RecordId=20193&Template=view&TableId=1023"><B>26165</B>
<A HREF="tmtrack.dll?IssuePage&SolutionId=8&RecordId=21811&Template=view&TableId=1023"><B>28722</B>
<A HREF="tmtrack.dll?IssuePage&SolutionId=8&RecordId=22163&Template=view&TableId=1023"><B>29327</B>
<A HREF="tmtrack.dll?IssuePage&SolutionId=8&RecordId=22238&Template=view&TableId=1023"><B>29450</B>

But i am havin trouble stripping off the \A HREF tag.

The HREF regex must be ok - it works on the one liner.

bash-3.00$ /casper/strip | perl -nle 'print /^<A\sHREF="(.*)"/'
tmtrack.dll?IssuePage&SolutionId=8&RecordId=20193&Template=view&TableId=1023
tmtrack.dll?IssuePage&SolutionId=8&RecordId=21811&Template=view&TableId=1023
tmtrack.dll?IssuePage&SolutionId=8&RecordId=22163&Template=view&TableId=1023
tmtrack.dll?IssuePage&SolutionId=8&RecordId=22238&Template=view&TableId=1023

i must be doing something wrong with the script - i need to learn why this does not strip off the html tags. I am posting this because I run into this error all the time and just end up using the perl extract from the command line instead of withing a script. I am not learning past this.

Upvotes: 2

Views: 178

Answers (3)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89574

replace in your code s/\<\/A\>/\n/g; by s/\<\/A\>\K/\n/g; or s/(?<=<\/A>)/\n/g

Since \K resets the match before it, your closing tag is not removed.

Note: As far i know, you don't need to escape < and >

Note2: the href part of your code works only because the dot doesn't match newlines by default .* match all the line, then the regex engine backtracks to find the double quote). A better way is to use a lazy quantifier instead: <A\s+HREF="(.*?)". A more better way is to use \S* instead: <A\s+HREF="(\S*)" (only one backtrack step for the double quote, since an URL doesn't have white spaces inside). Or <A\s+HREF="([^"]+)" that avoid to match double quotes.

Upvotes: 1

Birei
Birei

Reputation: 36272

One solution that checks if a line begins with <A to append the next one and do the regular expression matching to extract first grouped expression:

#!/usr/bin/env perl

use warnings;
use strict;

my $list = "/tmp/rawurl_list";
open( my $filehandle ,"<", "$list") or die $!; 
while (<$filehandle>) {
    chomp;
    if ( m/^<A/ ) { 
        $_ .= <$filehandle>;
        if ($_ =~ /^<A HREF="(.*)"/) {
           print "$1\n";
        }       
    }   
}

It yields:

tmtrack.dll?IssuePage&SolutionId=8&RecordId=20193&Template=view&TableId=1023
tmtrack.dll?IssuePage&SolutionId=8&RecordId=21811&Template=view&TableId=1023
tmtrack.dll?IssuePage&SolutionId=8&RecordId=22163&Template=view&TableId =1023
tmtrack.dll?IssuePage&SolutionId=8&RecordId=22238&Template=view&TableId=1023

Upvotes: 1

ysth
ysth

Reputation: 98398

Your script is only reading one line at a time; the ending " is only encountered on the following iteration of the while loop. If you want to read one link at a time, try adding:

local $/ = '</A>';

before the while(). (See $/.)

Upvotes: 4

Related Questions