Reputation: 31
I have been working on some simple bash
script recently, which parses specific data from webpages. I have used tr '\r\n' ' ' <file1.txt >file2.txt
to make sure, all extracted data from page is stored in file1.txt
in one row. So then I need to match all strings between <th>...</th>
tags in this line and delete them or replace with ' '
sign.
So here is some expamle code:
<td>Abaktal hm</td> </tr> <tr> <th>Package</th> <td>flm 10x400 mg</td> <th>Indesit</th>
I have used sed
and tried something like
sed -i 's/\<th\>.*?\<\/th\>/ /g' output.txt
But it didn't work. I think problem is in ?
sign. It works with ?
sign in regular expressions, but probably not in bash
.
Upvotes: 3
Views: 1344
Reputation: 185025
Your attempt seems definitely wrong.
You can't realistically parse tag-based markup languages like HTML
and XML
using Bash
or utilities such as grep
, sed
or cut
. If you just want to dump/render HTML
, see (links|links2|lynx|w3m) -dump
, html2text
, vilistextum
. For parsing out pieces of data, see tidy+(xmlstarlet|xmllint|xmlgawk|xpath|xml2)
, or learn xslt
.
See
Upvotes: 0
Reputation: 1507
<td>
Abaktal hm
</td>
<th>
Package
</th>
<td>
flm 10x400 mg</td>
<th>
Indesit
</th>
If you have this type of input the below command will work
sed -n '//{p; :a; N; /</th>/!ba; s/.*\n//}; p' output.txt
It will delete the content between
<th>...</th> tags
For more info removing lines between two patterns (not inclusive) with sed
Upvotes: 0
Reputation: 1145
While I agree with sputnick and others, the answer to your immediate question would be:
sed -ir 's/<th>[^<]+<\/th>//g'
This works on your sample data just fine.
Upvotes: 4