Reputation: 21

Two characters seem identical but UTF-8 encodings are not identical

I need to filter some illegal strings like "Password", but I found someone bypassed my check program. They input a string that seems exactly "Password" but it's not equal. I checked the Unicode of it and, for example, the "a" is 8e61, while normal "a" is 61 (hex). My PHP files' encoding, HTML meta Content-Type and MySQL encoding are utf-8.

How does this happen? Why there're visually identical characters with different codes? I want to know how can I filter these characters. I put the weird string here, please copy it for research: Password

For some reason when I copied the "Password" with problem here, it actually displayed ASCII one.

I use PHP function bin2hex() on "Password", and get below:

50c28e61c28e73c28e73c28e776fc28e72c28e64c28e

while a normal one is:

50617373776f7264.

To make it simpler, the hexadecimal representation for "a" is:

c28e61

while normal one is:

Upvotes: 2

Answers (2)

Jonathan Leffler

Reputation: 754010

Given the hex string 50c28e61c28e73c28e73c28e776fc28e72c28e64c28e, you have an encoding of a valid UTF-8 string:

0x50      = U+0050 = P
0xC2 0x8E = U+008E = SS2
0x61      = U+0061 = a
0xC2 0x8E = U+008E = SS2
0x73      = U+0073 = s
0xC2 0x8E = U+008E = SS2
0x73      = U+0073 = s
0xC2 0x8E = U+008E = SS2
0x77      = U+0077 = w
0x6F      = U+006F = o
0xC2 0x8E = U+008E = SS2
0x72      = U+0072 = r
0xC2 0x8E = U+008E = SS2
0x64      = U+0064 = d
0xC2 0x8E = U+008E = SS2

The 0xC2 0x8E sequence maps to ISO 8859-1 0x8E, which is a control character SS2 or Single Shift 2 (see Unicode Code Charts). SS2 doesn't have a defined visible representation. The string is clearly different from plain 'Password'. As long as you don't strip out control characters, you should be able to spot the difference as a string comparison should not treat that as identical to plain 'Password'.

Upvotes: 1

Joey

Reputation: 354576

What you might be seeing (I can't tell exactly because parts of your question don't make sense or are inconsistent) are so-called homoglyphs. Those are characters that look identical or very similar and thus can be mistaken at first glance. To circumvent your check people can use a Cyrillic a and get away with it. But frankly, this isn't actually a problem because I know no password cracker that will actually try mixing scripts, as most passwords are ASCII-only.

As for the why, you can take a look at Why are there duplicate characters in Unicode?.

Upvotes: 0

Two characters seem identical but UTF-8 encodings are not identical

Answers (2)

Related Questions