Bishal Kumar Shrestha
Bishal Kumar Shrestha

Reputation: 17

Multiple occurrences in sed substitution

I am trying to retrieve some data within a specific div tag in my html file.

My current html code is in the following format.

<div class = "class0">
    <div class = "class1">
         <div class = "class2">
             some text some text
         </div>
         Some more text
    </div>
    Too much text
</div>

When I try to extract tag in just the div with class2, using the bash code

sed -e ':a;N;$!ba
        s/[[:space:]]\+/ /g
        s/.*<div class\="class2">\(.*\).*/\1/g' test.html > out.html

I get the output html file with the code as

some text some text </div> Some more text </div> Too much text

I want all the data after the first </div> to be removed but instead the final one is being replaced. Can someone please elaborate my mistake.

Upvotes: 0

Views: 79

Answers (2)

Tom Fenech
Tom Fenech

Reputation: 74625

You could do this in awk:

awk '/class2/,/<\/div>/ {a[++i]=$0}END{for (j=2;j<i;++j) print a[j]}' file

Between the lines that match /class2/ and /<\/div>/, write the contents to an array. At the end of the file loop through the array, skipping the first and last lines.

Instead of making an array, you could check for the first and last lines using a regular expression:

awk '/class2/,/<\/div>/ {if (!/class2|<\/div>/) print}' file

Upvotes: 1

Keith Reynolds
Keith Reynolds

Reputation: 853

This works for retrieving text inside the div class = "class2" tags

#!/bin/bash

htmlcode='
<div class = "class0">
    <div class = "class1">
        <div class = "class2">
            some text some text
        </div>
        Some more text
    </div>
   Too much text
</div>
'

echo $htmlcode |
sed -e's,<,\
<,g' |
grep 'div class = "class2"' |
sed -e's,>,>\
,g'|
grep -v 'div class = "class2"'

Upvotes: 0

Related Questions