Reputation: 21
I need to filter some illegal strings like "Password", but I found someone bypassed my check program. They input a string that seems exactly "Password" but it's not equal.
I checked the Unicode of it and, for example, the "a" is 8e61
, while normal "a" is 61
(hex).
My PHP files' encoding, HTML meta Content-Type and MySQL encoding are utf-8.
How does this happen? Why there're visually identical characters with different codes? I want to know how can I filter these characters. I put the weird string here, please copy it for research: Password
For some reason when I copied the "Password" with problem here, it actually displayed ASCII one.
I use PHP function bin2hex() on "Password", and get below:
50c28e61c28e73c28e73c28e776fc28e72c28e64c28e
while a normal one is:
50617373776f7264.
To make it simpler, the hexadecimal representation for "a" is:
c28e61
while normal one is:
61
Upvotes: 2
Views: 982
Reputation: 754010
Given the hex string 50c28e61c28e73c28e73c28e776fc28e72c28e64c28e
, you have an encoding of a valid UTF-8 string:
0x50 = U+0050 = P
0xC2 0x8E = U+008E = SS2
0x61 = U+0061 = a
0xC2 0x8E = U+008E = SS2
0x73 = U+0073 = s
0xC2 0x8E = U+008E = SS2
0x73 = U+0073 = s
0xC2 0x8E = U+008E = SS2
0x77 = U+0077 = w
0x6F = U+006F = o
0xC2 0x8E = U+008E = SS2
0x72 = U+0072 = r
0xC2 0x8E = U+008E = SS2
0x64 = U+0064 = d
0xC2 0x8E = U+008E = SS2
The 0xC2 0x8E sequence maps to ISO 8859-1 0x8E, which is a control character SS2 or Single Shift 2 (see Unicode Code Charts). SS2 doesn't have a defined visible representation. The string is clearly different from plain 'Password'. As long as you don't strip out control characters, you should be able to spot the difference as a string comparison should not treat that as identical to plain 'Password'.
Upvotes: 1
Reputation: 354576
What you might be seeing (I can't tell exactly because parts of your question don't make sense or are inconsistent) are so-called homoglyphs. Those are characters that look identical or very similar and thus can be mistaken at first glance. To circumvent your check people can use a Cyrillic a and get away with it. But frankly, this isn't actually a problem because I know no password cracker that will actually try mixing scripts, as most passwords are ASCII-only.
As for the why, you can take a look at Why are there duplicate characters in Unicode?.
Upvotes: 0