george
george

Reputation: 462

How to match the last occurrence of a pattern on a single line string

I am using this command line to get a particular line from an html file which contains various other tags, links etc.:

cat index.html | grep -m1 -oE '<a href="(.*?)" rel="sample"[\S\s]*.*</dd>'

It outputs the line which I want:

<a href="http://example.com/something/one/" rel="sample" >Foo</a> <a href="http://example.com/something/two/" rel="sample" >Bar</a></dd>

But I want to capture only something/two (the path of the last URL) considering that:

How can I do that?

Upvotes: 1

Views: 1383

Answers (3)

mklement0
mklement0

Reputation: 440679

On Linux, GNU grep's -P option enables a concise solution:

$ grep -oP '.*<a href="http://.+?/\K[^"]+(?=/"\s*rel="sample".*</dd>$)' index.html
something/two

-o only outputs the matching part(s) of each line that matches.

-P activates support for PRCEs (Perl-compatible Regular Expressions), which supports advanced regex constructs such as non-greedy matching (*?), dropping everything matched so far (\K), and look-ahead assertions ((?=...).

  • The combination of \K and (?=...) allows constraining the matching part of the regex to the subexpression of interest.
    Note that no grep implementation supports capture groups, but the above, thanks to the features enabled by -P, is an emulation of extracting a single capture-group value.

As for what you tried:

  • -m1 limits the number of matching lines to 1, but with -o also present, multiple matches on that 1 line are still all printed.

    • Additionally, while you can use (...) for precedence, that doesn't constitute a capture group in grep, because there's no support for extracting capture-group values in grep.
  • Even with -E for extended regex support, advanced constructs such as non-greedy matching (.*?) are not supported.

Upvotes: 1

TrentP
TrentP

Reputation: 4722

If you can use perl, then capturing within a regex makes this a lot easier.

 perl -ne 'm(.*<a href="[^:]+://[^/]*/(.*?)" rel="sample".*</dd>) and print "$1\n";'

The regex is basically the same as would also work with grep. I've used m() instead of // to avoid escaping the / inside the regex.

The initial .* will greedily capture everything at the beginning of the line. If you have multiple links on a line, it will capture all but the last. This works with grep too, but it causes grep -o to output the beginning of the line, since this now matches the regex.

This doesn't matter with the capturing parenthesis, as only the part inside the (.*?) is captured and printed.

It would be used the same way as grep.

cat index.html | perl -ne 'm(.*<a href="[^:]+://[^/]*/(.*?)" rel="sample".*</dd>) and print "$1\n";'

or

perl -ne 'm(.*<a href="[^:]+://[^/]*/(.*?)" rel="sample".*</dd>) and print "$1\n";' index.html

Upvotes: 1

choroba
choroba

Reputation: 242443

Just add

| grep -o 'href="[^"]*' | tail -n1

The first part only extracts the hrefs, the second part keeps only the last line.

If you want to extract only the path, you can use cut with delimiter set to / and extract everything starting from the fourth column:

| grep -o 'href="[^"]*' | tail -n1 | cut -f4- -d/

because

href="http://example.com/something/two/
1          23            4         5

Upvotes: 2

Related Questions