Steven Matthews
Steven Matthews

Reputation: 11275

Clean HTML from Word documents

Ok, so my company has a client that has an interface for posting content - standard MySQL database, PHP-based, etc.

Anyway, they've continually had an intern or someone, post content to this interface straight from an MS Word doc - the interface is coded poorly, and takes this input as is, with no formatting.

My company has now been contracted out to fix this particular problem, as it is continually breaking their site, and my company has repeatedly had to manually go into the database, and delete the offending values.

Is there a quick and easy way to do this, or am I going to have to just do a replace operation on each offending character?

I see htmlentities() may be a partial solution - but as far as I know, that won't remove everything.

What's a good solution to this problem? Is there anything out there to make this easier?

We're also considering writing a content validator as well, probably just server-side (though maybe client-side, if my week is going slowly enough/I finish the rest of this quickly enough).

Upvotes: 0

Views: 370

Answers (1)

Jules
Jules

Reputation: 91

It depends on how many clients (or potential clients) you are supporting and how much time you have to invest. Options

  • Write your own function to strip out the metadata

  • Teach your clients to remove it themselves such as paste in notepad first,
    or supply a knowledge base article to explain how to do it in the software. Perhaps a "Help" section or icon they can click on. htttp://support.microsoft.com/default.aspx?scid=kb;en-us;223396

  • Use a WYSIWYG editor such as TinyMCE which has built in functionality to remove it

But like I said in the comments, unless you are using your own function, prepare for clients to continue to paste directly and wonder why there is a problem.

Upvotes: 1

Related Questions