Reputation: 833
I have an HTML file of which I need to get only an specific part. The biggest challenge here is that this HTML file doesn't have linebreaks, so my grep expression isn't working well.
Here is my HTML file:
<a href="/link1" param1="data1_1" param2="1_2"><p>Test1</p></a><a href="/link2" param1="data1_1" param2="1_2"><p>Test2</p></a>
Note that I have two anchors (<a>
) on this line.
I want to get the second anchor and I was trying to get it using:
cat example.html | grep -o "<a.*Test2</p></a>"
Unfortunately, this command returns the whole line, but I want only:
<a href="/link2" param1="data1_1" param2="1_2"><p>Test2</p></a>
I don't know how to do this with grep or sed, I'd really appreciate any help.
Upvotes: 1
Views: 241
Reputation: 47099
Using Perl:
$ perl -pe '@a = split(m~(?<=</a>)~, $_);$_ = $a[1]' file
<a href="/link2" param1="data1_1" param2="1_2"><p>Test2</p></a>
Breakdown:
perl -pe ' ' # Read line for line into $_
# and print $_ at the end
m~(?<=</a>)~ # Match the position after
# each </a> tag
@a = split( , $_); # Split into array @a
$_ = $a[1] # Take second item
Upvotes: 0
Reputation: 203209
With GNU awk for multi-char RS, if it's the second record you want:
$ awk 'BEGIN{RS="</a>"; ORS=RS"\n"} NR==2' file
<a href="/link2" param1="data1_1" param2="1_2"><p>Test2</p></a>
or if it's the record labeled "Test2":
$ awk 'BEGIN{RS="</a>"; ORS=RS"\n"} /<p>Test2<\/p>/' file
<a href="/link2" param1="data1_1" param2="1_2"><p>Test2</p></a>
or:
$ awk 'BEGIN{RS="</a>"; ORS=RS"\n"; FS="</?p>"} $2=="Test2"' file
<a href="/link2" param1="data1_1" param2="1_2"><p>Test2</p></a>
Upvotes: 1