Reputation: 51

Non greedy regex matching in sed/perl

I was doing sed /http.*.torrent/s/.*(http.*.torrent).*/\1/;/http.*.torrent/p 1.html to extract links. However since sed lacks non-greedy quantifier (which is needed because further in the line there is again 'torrent'), tried to convert it to perl. Though need help with perl. (Or if you know how to do it with sed, say so.) perl -ne s/.*(http.*?.torrent).*/\1/ 1.html Now I need to add this part, after convering it from sed: /http.*.torrent/p

This was a part of sed /http.*.torrent/s/.*(http.*.torrent).*/\1/;/http.*.torrent/p 1.html

but this didn't work either; sed started but didn't quit, and as I pressed keys they echoed and nothing else.

Upvotes: 3

Answers (2)

DavidO

Reputation: 13942

I recommend letting a well proven module such as HTML::LinkExtor do the heavy lifting for you, and use a regexp simply to validate the links that it finds. See the example below of just how easy it can be.

use Modern::Perl;
use HTML::LinkExtor;
use Data::Dumper;

my @links;


# A callback for LinkExtor. Disqualifies non-conforming links, and pushes
# into @links any conforming links.

sub callback {
    my ( $tag, %attr ) = @_;
    return if $tag ne 'a';
    return unless $attr{href} =~ m{http(?:s)?://[^/]*torrent}i;
    push @links, \%attr;
}


# The work is done here: Read the html file, parse it, and move on.
undef $/;
my $html = <DATA>;
my $p = HTML::LinkExtor->new(\&callback);
$p->parse( $html );

print Dumper \@links;

__DATA__
<a href="https://toPB.torrent" title="Download this torrent">The goal</a>
<a href="http://this.is.my.torrent.com" title="testlink">Testing2</a> <a href="http://another.torrent.org" title="bwahaha">Two links on one line</a>
<a href="https://toPBJ.torrent.biz" title="Last test">Final Test</a>
A line of nothingness...
That's all folks.

HTML::LinkExtor lets you set up a callback function. The module itself parses your HTML document to find any links. You are looking for the 'a' links (as opposed to 'img', etc.). So in your callback function you just exit as soon as possible unless you have an 'a' link. Then test that 'a' link to see if there's a 'torrent' name in it, in an appropriate position. If that particular regexp isn't what you need, you'll have to be more specific, but I think it's what you were after. As links are found they're pushed onto a data structure. At the end of my test script I print the structure so you can see what you have.

The __DATA__ section contains some sample HTML snippets, along with junk text to verify that it's only finding links.

Using a well tested module to parse your HTML is so much more durable than constructing fragile regular expressions to do the whole job. Many well-made parsing solutions include regular expressions under the hood, but only to do little bits and pieces of the work here and there. When you start relying on a regexp to do the parsing (as opposed to the identifying of small building blocks), you run out of gas quickly.

Have fun.

Upvotes: 4

unpythonic

Reputation: 4070

sed doesn't have non-greedy matching, so your best bet is just to use perl:

perl -ne '/.*?(http.*?.torrent)/ && print "$1\n"' 1.html

The -n argument tells perl to read each line of input (from 1.html in this case, or from stdin if no file(s) are on the cmdline) and run something against each line... the -e gives the "something to execute" on the command line.

The first part of the expression matches against the expression you were looking for, with the parentheses capturing your interesting bits into $1. If it matches, it evaluates to true, and so will then execute the print (giving you your match along with a newline).

Upvotes: 3

Non greedy regex matching in sed/perl

Answers (2)

Related Questions