Reputation: 462
I am using this command line to get a particular line from an html file which contains various other tags, links etc.:
cat index.html | grep -m1 -oE '<a href="(.*?)" rel="sample"[\S\s]*.*</dd>'
It outputs the line which I want:
<a href="http://example.com/something/one/" rel="sample" >Foo</a> <a href="http://example.com/something/two/" rel="sample" >Bar</a></dd>
But I want to capture only something/two
(the path of the last URL) considering that:
the line can sometimes contain only 1 URL, e.g.
<a href="http://example.com/something/one/" rel="sample" >Foo</a></dd>
in which case I would want to get only something/one
as in this case it is the last one.
How can I do that?
Upvotes: 1
Views: 1383
Reputation: 440679
On Linux, GNU grep
's -P
option enables a concise solution:
$ grep -oP '.*<a href="http://.+?/\K[^"]+(?=/"\s*rel="sample".*</dd>$)' index.html
something/two
-o
only outputs the matching part(s) of each line that matches.
-P
activates support for PRCEs (Perl-compatible Regular Expressions), which supports advanced regex constructs such as non-greedy matching (*?
), dropping everything matched so far (\K
), and look-ahead assertions ((?=...
).
\K
and (?=...)
allows constraining the matching part of the regex to the subexpression of interest.grep
implementation supports capture groups, but the above, thanks to the features enabled by -P
, is an emulation of extracting a single capture-group value.As for what you tried:
-m1
limits the number of matching lines to 1, but with -o
also present, multiple matches on that 1 line are still all printed.
(...)
for precedence, that doesn't constitute a capture group in grep
, because there's no support for extracting capture-group values in grep
.Even with -E
for extended regex support, advanced constructs such as non-greedy matching (.*?
) are not supported.
Upvotes: 1
Reputation: 4722
If you can use perl, then capturing within a regex makes this a lot easier.
perl -ne 'm(.*<a href="[^:]+://[^/]*/(.*?)" rel="sample".*</dd>) and print "$1\n";'
The regex is basically the same as would also work with grep. I've used m()
instead of //
to avoid escaping the /
inside the regex.
The initial .*
will greedily capture everything at the beginning of the line. If you have multiple links on a line, it will capture all but the last. This works with grep too, but it causes grep -o
to output the beginning of the line, since this now matches the regex.
This doesn't matter with the capturing parenthesis, as only the part inside the (.*?)
is captured and printed.
It would be used the same way as grep.
cat index.html | perl -ne 'm(.*<a href="[^:]+://[^/]*/(.*?)" rel="sample".*</dd>) and print "$1\n";'
or
perl -ne 'm(.*<a href="[^:]+://[^/]*/(.*?)" rel="sample".*</dd>) and print "$1\n";' index.html
Upvotes: 1
Reputation: 242443
Just add
| grep -o 'href="[^"]*' | tail -n1
The first part only extracts the href
s, the second part keeps only the last line.
If you want to extract only the path, you can use cut
with delimiter set to /
and extract everything starting from the fourth column:
| grep -o 'href="[^"]*' | tail -n1 | cut -f4- -d/
because
href="http://example.com/something/two/
1 23 4 5
Upvotes: 2