Egel
Egel

Reputation: 1980

Extract text between two strings in simple example.html file

I have a very basic html file called example.html (see below)

<html>
<body>
<div class="one">
    <div class="research">
        <div class="two">
            <p>Lorem ipsum...</p>
        </div>
        <div class="three">
            <p>Lorem ipsum...</p>
        </div>
        <div class="four">
            <p>Lorem ipsum...</p>
        </div>
    </div>  
</div>
</body>
</html>

and I'd like to get only phrase like (see below), but not by removing first and last 3 lines.

<div class="research">
    <p>Lorem ipsum...</p>
    <div class="two"></div>
    <div class="three"></div>
    <div class="four"></div>
</div>

I have tried with awk:

cat example.html | awk '/^<div\ class="research">$/,/^<\/div>$/ { print }'

but something seems to be wrong.

I also tried with body tag (see below)

cat example.html | awk '/^<body>$/,/^<\/body>$/ { print }'

(result)

<body>
<div class="one">
    <div class="research">
        <div class="two">
            <p>Lorem ipsum...</p>
        </div>
        <div class="three">
            <p>Lorem ipsum...</p>
        </div>
        <div class="four">
            <p>Lorem ipsum...</p>
        </div>
    </div>  
</div>
</body>

And it's working correctly.

What I've doing wrong?

Thanks in advance.

Upvotes: 0

Views: 792

Answers (1)

glenn jackman
glenn jackman

Reputation: 247092

You cannot parse HTML with regular expressions. Assuming the html is valid xml, you can use:

xmlstarlet sel -t -c '//div[@class="research"]' -nl example.html  
<div class="research">
        <div class="two">
            <p>Lorem ipsum...</p>
        </div>
        <div class="three">
            <p>Lorem ipsum...</p>
        </div>
        <div class="four">
            <p>Lorem ipsum...</p>
        </div>
    </div>

Upvotes: 6

Related Questions