virtualeyes
virtualeyes

Reputation: 11237

HTML Tidy, cleaning up MS Word markup

Have 10 years of archived article data, most of it riddled with MS Word save-as-html markup like <p class="MsoNormal">

First of all, is html tidy up to the task of stripping out MS Word generated markup, or do I need to take another approach?

Secondly, the first few years of articles are globbed together by month and stored in DB as text storage type. I'd dearly love to break these out into individual articles so I can make the site more easily searched (i.e. not bring up an entire month of news when a search term/phrase matches). The only clear pattern I have to work with to isolate the articles is the article title (in bold, between 16-20px) and the article date, generally 10px; both title and date appear prior to article body text. Is there a way to detect the <h1>-ness or <small>-ness of markup when I do not have exact markup to match against?

This may be next to impossible to answer, but just in general, what approach would you take to this unenviable task? ;-) I'm on the JVM in Scala, but could do the cleanup job on LAMP stack as well.

Ideas appreciated!

Upvotes: 2

Views: 601

Answers (1)

Dmitry Ovsyanko
Dmitry Ovsyanko

Reputation: 1416

If I was you, I'd use my favorite HTML::Parser kit for Perl. If goes very well for complex and fuzzily stated problems like yours one.

Upvotes: 1

Related Questions