UncleSam
UncleSam

Reputation: 31

How to delete string between two HTML tags in one row using bash script

I have been working on some simple bash script recently, which parses specific data from webpages. I have used tr '\r\n' ' ' <file1.txt >file2.txt to make sure, all extracted data from page is stored in file1.txt in one row. So then I need to match all strings between <th>...</th> tags in this line and delete them or replace with ' ' sign. So here is some expamle code:

    <td>Abaktal hm</td> </tr> <tr> <th>Package</th> <td>flm 10x400 mg</td> <th>Indesit</th>

I have used sed and tried something like

    sed -i 's/\<th\>.*?\<\/th\>/ /g' output.txt

But it didn't work. I think problem is in ? sign. It works with ? sign in regular expressions, but probably not in bash.

Upvotes: 3

Views: 1344

Answers (3)

Gilles Qu&#233;not
Gilles Qu&#233;not

Reputation: 185025

Your attempt seems definitely wrong.

You can't realistically parse tag-based markup languages like HTML and XML using Bash or utilities such as grep, sed or cut. If you just want to dump/render HTML, see (links|links2|lynx|w3m) -dump, html2text, vilistextum. For parsing out pieces of data, see tidy+(xmlstarlet|xmllint|xmlgawk|xpath|xml2), or learn xslt.

See

Upvotes: 0

Triangle
Triangle

Reputation: 1507

 <td>
     Abaktal hm
 </td>
 <th>
     Package
 </th> 
 <td>
     flm 10x400 mg</td>
 <th> 
     Indesit
 </th>

If you have this type of input the below command will work

sed -n '//{p; :a; N; /</th>/!ba; s/.*\n//}; p' output.txt

It will delete the content between

 <th>...</th> tags

For more info removing lines between two patterns (not inclusive) with sed

Upvotes: 0

weldabar
weldabar

Reputation: 1145

While I agree with sputnick and others, the answer to your immediate question would be:

sed -ir 's/<th>[^<]+<\/th>//g'

This works on your sample data just fine.

Upvotes: 4

Related Questions