Darryl Hein
Darryl Hein

Reputation: 145137

How do you deal with the "special" characters that MS Word adds?

I'm wondering how you clean the special characters that MS Word as, such as m- and n-dashes and curly quotes?

I often find myself copying content from clients from Word and pasting into a static HTML page, but the content ends up with weird characters because the special characters are not converted to their correct ACSII codes and therefore show up as garbled text. (For these basic websites, I'm using Dreamweaver.)

I have seen a lot of similar problems when clients copy content from Word into text only fields (mostly textareas). When I put this into a PDF (through PHP) or it shows up on the page it too has garbled text.

How do you deal with this? Is there a cleaning service or program you use?

Upvotes: 10

Views: 29032

Answers (6)

JasonPlutext
JasonPlutext

Reputation: 15878

Make sure Word is configured to use UTF-8 for "Save As.." HTML.

This is in Options > Word Options > Advanced > Web Options > Encoding

Upvotes: 4

Rutunj sheladiya
Rutunj sheladiya

Reputation: 646

You can use preg_replace function call to remove all special characters of word or others from your string

 preg_replace('/[^\x00-\x7F]+/', '', $str);

Upvotes: 6

chazomaticus
chazomaticus

Reputation: 15786

With regards to clients posting copy/pasted text from Word in textareas:

The most reliable way to ensure that the client sends you text in any particular encoding (thus hopefully doing any conversion from CP-1252 [or whatever Word uses] for you), is to add the accept-charset="..." attribute to all your <form>s. E.g.:

<form ... accept-charset="UTF-8">
   ...
</form>

Most browsers will obey that and make sure any "Word-specific" characters are converted to the appropriate character set before it gets to your website.

Once invalid text gets to your website, there's very little you can do to fix it reliably, so it's best to simply check all input for being valid in whatever character set you use, and discard any requests that have invalid text. This is necessary even with accept-charset, because undoubtedly there are some clients out there that will ignore it.

Upvotes: 9

Scott
Scott

Reputation: 2793

If it's a Word file that's just text (i.e.: no graphics, tables, etc.), you might try Saving As HTML from within Word, copy/pasting the resulting HTML into your document in Dreamweaver, and then use Dreamweaver's "Clean Up Word HTML" function (under the Command menu).

As an alternative, you can try fix my HTML, though I've not personally tried it with Word text, so results may vary.

Upvotes: 0

Adrien
Adrien

Reputation: 3205

You might try the Demoroniser.

Upvotes: 2

Michael Borgwardt
Michael Borgwardt

Reputation: 346546

Pay attention to specify an encoding everywhere and use UTF-8, then those "special" characters should survive just fine. But once they've gone through an encoding that can't represent them, the information which character it was originally is lost, so it can't be repaired (except for some specific though probably very common cases like switching between Cp1252 and ISO-8859-1).

Upvotes: 4

Related Questions