Sanitizing input text: characters not being encoded properly

Question

When I copy and paste text from a word document to a notepad, I get these weird characters (presumably due to encoding problems) like this

... of var¬ious Federal ...

with "¬" being the weirdly encoded symbol. When I read the text file in PHP, I want to remove all of these weirdly encoded symbols. I tried replace "¬" with an empty string

return preg_replace('/¬/', '', $string);

but when I return the text to an HTML webpage, that just results in another weird character being put in place of the word

... of var�ious Federal ...

Why is this happening, and what can I do to fix it?

Jon · Accepted Answer

Brief intro on character sets and encodings

When documents are displayed on the screen, humans parse them as sequences of characters (which in the context of computer text processing are also called glyphs). However, when documents are stored on a disk they are written as sequences of bytes, just like it happens for all other types of files. Therefore a system must be in place that takes care of translating from characters to bytes and vice versa.

Such a system is called a character encoding. Since encodings must be implemented by computers they need to be well-defined, so each encoding can only handle a predefined set of characters, which is unsurprisingly referred to as a character set.

Some encodings always represent each character with a single byte; these are called single-byte encodings. Other encodings use multiple bytes for each character and not necessarily the same number for all possible characters; these are called multibyte encodings.

To recap: a text document logically contains characters which are drawn from some predefined character set, but computers work in terms of bytes so we make up character encodings that convert characters to bytes and vice versa. Some encodings are called multibyte because they use multiple bytes to represent a single character.

Back to your problem

When you saved the text file to the disk, Notepad used some encoding to do it (it was a multibyte encoding, but let's pretend we don't know that for now). The character ¬ in the text was given some specific representation in the form of bytes.

When you saved the PHP file to the disk, your source code editor used some encoding to do it. The character ¬ in the string literal '/¬/' was given some specific representation in the form of bytes.

By default preg_replace, just like all the generic-use string functions in PHP, operates in binary mode. This means that it works in terms of bytes. This is in contrast to your source code editor, which is encoding-aware and displays the source in terms of characters. As a result, when you replace what you believe is the character ¬ (NOT SIGN), preg_replace in fact replaces a series of bytes, the exact form of which depends on the encoding of your PHP source.

And therein lies the problem: if the encodings of the text file and your source do not match, all bets are off as to what might actually happen to the text.

Given the results you show, what happened in your case is most probably this:

The text file was saved in some multibyte encoding.
The PHP source was saved in a single-byte encoding.
The single-byte representation of ¬ in the PHP source was part of the multibyte representation of ¬ in the text, so one of those bytes was wiped out.
The remaining byte(s) do not fit the rules of the encoding, so the program that displays the text after the replacement shows a question mark to say "there's something in here, but it's no character that I recognize".

How to fix it

Several possibilities that are all in line with the above, but they all share one common attribute: you must know the encoding of your text file (you can easily do this with Notepad: "Save As" and look at the bottom of the dialog box). Then you can:

Save your text file and PHP source using the same encoding and everything will work. This is the easiest by far.
Inject into your PHP source the bytes that represent the target character in the encoding of your text file. For example, assume that the text file is saved as UTF-8. This encoding represents the character in question with the byte sequence 0xC2 0xA2, so you can replace this byte sequence by writing the code as
```
preg_replace("/\xc2\xa2/", '', $string)
```
and as long as the text file encoding remains UTF-8 this will work no matter what the encoding of your PHP source is.

Sanitizing input text: characters not being encoded properly

Answers (2)

Brief intro on character sets and encodings

Back to your problem

How to fix it

Related Questions