Reputation: 17
I am trying to retrieve some data within a specific div tag in my html file.
My current html code is in the following format.
<div class = "class0">
<div class = "class1">
<div class = "class2">
some text some text
</div>
Some more text
</div>
Too much text
</div>
When I try to extract tag in just the div with class2, using the bash code
sed -e ':a;N;$!ba
s/[[:space:]]\+/ /g
s/.*<div class\="class2">\(.*\).*/\1/g' test.html > out.html
I get the output html file with the code as
some text some text </div> Some more text </div> Too much text
I want all the data after the first </div>
to be removed but instead the final one is being replaced.
Can someone please elaborate my mistake.
Upvotes: 0
Views: 79
Reputation: 74625
You could do this in awk:
awk '/class2/,/<\/div>/ {a[++i]=$0}END{for (j=2;j<i;++j) print a[j]}' file
Between the lines that match /class2/
and /<\/div>/
, write the contents to an array. At the end of the file loop through the array, skipping the first and last lines.
Instead of making an array, you could check for the first and last lines using a regular expression:
awk '/class2/,/<\/div>/ {if (!/class2|<\/div>/) print}' file
Upvotes: 1
Reputation: 853
This works for retrieving text inside the div class = "class2" tags
#!/bin/bash
htmlcode='
<div class = "class0">
<div class = "class1">
<div class = "class2">
some text some text
</div>
Some more text
</div>
Too much text
</div>
'
echo $htmlcode |
sed -e's,<,\
<,g' |
grep 'div class = "class2"' |
sed -e's,>,>\
,g'|
grep -v 'div class = "class2"'
Upvotes: 0