Reputation: 2635
i just want to say that i understand that you can't parse HTML with regexes. i get that. you cannot parse HTML with regex.
I am just getting a few urls from a webpage.
the output is a little strange - there a a new line after the closing anchor tag.
<A HREF="tmtrack.dll?IssuePage&SolutionId=8&RecordId=20193&Template=view&TableId
=1023"><B>26165</B></A>
<A HREF="tmtrack.dll?IssuePage&SolutionId=8&RecordId=21811&Template=view&TableId
=1023"><B>28722</B></A>
<A HREF="tmtrack.dll?IssuePage&SolutionId=8&RecordId=22163&Template=view&TableId
=1023"><B>29327</B></A>
<A HREF="tmtrack.dll?IssuePage&SolutionId=8&RecordId=22238&Template=view&TableId
=1023"><B>29450</B></A>
So i write this little script to make it neater.
#!/usr/bin/perl
use strict;
use warnings ;
my $list = "/tmp/rawurl_list";
open( my $filehandle ,"<", "$list") or die $!;
while (<$filehandle>) {
s/\n//g;
s/\<\/A\>/\n/g;
print $_ ;
if ($_ =~ /^<A HREF="(.*)"/) {
print $1;
}
}
and this is what i get
<A HREF="tmtrack.dll? IssuePage&SolutionId=8&RecordId=20193&Template=view&TableId=1023"><B>26165</B>
<A HREF="tmtrack.dll?IssuePage&SolutionId=8&RecordId=21811&Template=view&TableId=1023"><B>28722</B>
<A HREF="tmtrack.dll?IssuePage&SolutionId=8&RecordId=22163&Template=view&TableId=1023"><B>29327</B>
<A HREF="tmtrack.dll?IssuePage&SolutionId=8&RecordId=22238&Template=view&TableId=1023"><B>29450</B>
But i am havin trouble stripping off the \A HREF tag.
The HREF regex must be ok - it works on the one liner.
bash-3.00$ /casper/strip | perl -nle 'print /^<A\sHREF="(.*)"/'
tmtrack.dll?IssuePage&SolutionId=8&RecordId=20193&Template=view&TableId=1023
tmtrack.dll?IssuePage&SolutionId=8&RecordId=21811&Template=view&TableId=1023
tmtrack.dll?IssuePage&SolutionId=8&RecordId=22163&Template=view&TableId=1023
tmtrack.dll?IssuePage&SolutionId=8&RecordId=22238&Template=view&TableId=1023
i must be doing something wrong with the script - i need to learn why this does not strip off the html tags. I am posting this because I run into this error all the time and just end up using the perl extract from the command line instead of withing a script. I am not learning past this.
Upvotes: 2
Views: 178
Reputation: 89574
replace in your code s/\<\/A\>/\n/g;
by s/\<\/A\>\K/\n/g;
or s/(?<=<\/A>)/\n/g
Since \K
resets the match before it, your closing tag is not removed.
Note: As far i know, you don't need to escape <
and >
Note2: the href part of your code works only because the dot doesn't match newlines by default .*
match all the line, then the regex engine backtracks to find the double quote). A better way is to use a lazy quantifier instead: <A\s+HREF="(.*?)"
. A more better way is to use \S*
instead: <A\s+HREF="(\S*)"
(only one backtrack step for the double quote, since an URL doesn't have white spaces inside). Or <A\s+HREF="([^"]+)"
that avoid to match double quotes.
Upvotes: 1
Reputation: 36272
One solution that checks if a line begins with <A
to append the next one and do the regular expression matching to extract first grouped expression:
#!/usr/bin/env perl
use warnings;
use strict;
my $list = "/tmp/rawurl_list";
open( my $filehandle ,"<", "$list") or die $!;
while (<$filehandle>) {
chomp;
if ( m/^<A/ ) {
$_ .= <$filehandle>;
if ($_ =~ /^<A HREF="(.*)"/) {
print "$1\n";
}
}
}
It yields:
tmtrack.dll?IssuePage&SolutionId=8&RecordId=20193&Template=view&TableId=1023
tmtrack.dll?IssuePage&SolutionId=8&RecordId=21811&Template=view&TableId=1023
tmtrack.dll?IssuePage&SolutionId=8&RecordId=22163&Template=view&TableId =1023
tmtrack.dll?IssuePage&SolutionId=8&RecordId=22238&Template=view&TableId=1023
Upvotes: 1