Clean HTML from Word documents

Question

Ok, so my company has a client that has an interface for posting content - standard MySQL database, PHP-based, etc.

Anyway, they've continually had an intern or someone, post content to this interface straight from an MS Word doc - the interface is coded poorly, and takes this input as is, with no formatting.

My company has now been contracted out to fix this particular problem, as it is continually breaking their site, and my company has repeatedly had to manually go into the database, and delete the offending values.

Is there a quick and easy way to do this, or am I going to have to just do a replace operation on each offending character?

I see htmlentities() may be a partial solution - but as far as I know, that won't remove everything.

What's a good solution to this problem? Is there anything out there to make this easier?

We're also considering writing a content validator as well, probably just server-side (though maybe client-side, if my week is going slowly enough/I finish the rest of this quickly enough).

Clean HTML from Word documents

Answers (1)

Related Questions