moondog
moondog

Reputation: 1577

delete html tag, but not the tag content

I have a bunch of Word docs which were "saved as" filtered html. The html files contain extraneous ole-links which I need to delete. For example, I want to replace:

<h3><a name="OLE_LINK25">My Section Title</a></h3>

with

<h3>My Section Title</h3>

Any suggestions for how I might do this, in an automated way?

Upvotes: 1

Views: 201

Answers (2)

chown
chown

Reputation: 52738

You could try something like this (untested, make sure to test first):

sed -i".backup" 's/<([^ ]+) name="OLE[^"]*">([^<]+)<\/\1>/\2/g' *.html

What this will do is replace all occurrances of <TAG name="OLE....">WHATEVER_HERE</TAG> with just WHATEVER_HERE in all *.html files. It will also make a backup of each *.html file from FILENAME.html to FILENAME.html.backup

If necessary, download sed for Windows

Or gnu sed

Upvotes: 1

Paul Grime
Paul Grime

Reputation: 15104

Jsoup could help to remove all anchor tags with name starting with "OLE".

Elements anchors = doc.select("a[name^=OLE]");
for (Iterator it = anchors.iterator(); it.hasNext(); ) {
    Element anchor = it.next();
    String text = anchor.text();
    Element header = anchor.parent();
    header.text(text);
}

Upvotes: 1

Related Questions