HTML Tidy, cleaning up MS Word markup

Question

Have 10 years of archived article data, most of it riddled with MS Word save-as-html markup like

First of all, is html tidy up to the task of stripping out MS Word generated markup, or do I need to take another approach?

Secondly, the first few years of articles are globbed together by month and stored in DB as text storage type. I'd dearly love to break these out into individual articles so I can make the site more easily searched (i.e. not bring up an entire month of news when a search term/phrase matches). The only clear pattern I have to work with to isolate the articles is the article title (in bold, between 16-20px) and the article date, generally 10px; both title and date appear prior to article body text. Is there a way to detect the

-ness or -ness of markup when I do not have exact markup to match against?

This may be next to impossible to answer, but just in general, what approach would you take to this unenviable task? ;-) I'm on the JVM in Scala, but could do the cleanup job on LAMP stack as well.

Ideas appreciated!

HTML Tidy, cleaning up MS Word markup

Answers (1)

Related Questions