Reputation: 1577
I have a bunch of Word docs which were "saved as" filtered html. The html files contain extraneous ole-links which I need to delete. For example, I want to replace:
<h3><a name="OLE_LINK25">My Section Title</a></h3>
with
<h3>My Section Title</h3>
Any suggestions for how I might do this, in an automated way?
Upvotes: 1
Views: 201
Reputation: 52738
You could try something like this (untested, make sure to test first):
sed -i".backup" 's/<([^ ]+) name="OLE[^"]*">([^<]+)<\/\1>/\2/g' *.html
What this will do is replace all occurrances of <TAG name="OLE....">WHATEVER_HERE</TAG>
with just WHATEVER_HERE
in all *.html files. It will also make a backup of each *.html file from FILENAME.html to FILENAME.html.backup
If necessary, download sed for Windows
Or gnu sed
Upvotes: 1
Reputation: 15104
Jsoup could help to remove all anchor tags with name starting with "OLE".
Elements anchors = doc.select("a[name^=OLE]");
for (Iterator it = anchors.iterator(); it.hasNext(); ) {
Element anchor = it.next();
String text = anchor.text();
Element header = anchor.parent();
header.text(text);
}
Upvotes: 1