Reputation: 1980
I have a very basic html file called example.html
(see below)
<html>
<body>
<div class="one">
<div class="research">
<div class="two">
<p>Lorem ipsum...</p>
</div>
<div class="three">
<p>Lorem ipsum...</p>
</div>
<div class="four">
<p>Lorem ipsum...</p>
</div>
</div>
</div>
</body>
</html>
and I'd like to get only phrase like (see below), but not by removing first and last 3 lines.
<div class="research">
<p>Lorem ipsum...</p>
<div class="two"></div>
<div class="three"></div>
<div class="four"></div>
</div>
I have tried with awk
:
cat example.html | awk '/^<div\ class="research">$/,/^<\/div>$/ { print }'
but something seems to be wrong.
I also tried with body
tag (see below)
cat example.html | awk '/^<body>$/,/^<\/body>$/ { print }'
(result)
<body>
<div class="one">
<div class="research">
<div class="two">
<p>Lorem ipsum...</p>
</div>
<div class="three">
<p>Lorem ipsum...</p>
</div>
<div class="four">
<p>Lorem ipsum...</p>
</div>
</div>
</div>
</body>
And it's working correctly.
What I've doing wrong?
Thanks in advance.
Upvotes: 0
Views: 792
Reputation: 247092
You cannot parse HTML with regular expressions. Assuming the html is valid xml, you can use:
xmlstarlet sel -t -c '//div[@class="research"]' -nl example.html
<div class="research">
<div class="two">
<p>Lorem ipsum...</p>
</div>
<div class="three">
<p>Lorem ipsum...</p>
</div>
<div class="four">
<p>Lorem ipsum...</p>
</div>
</div>
Upvotes: 6