how to calculate URL encoding for characters outside the ASCII character set?

Question

I know that for ASCII characters the URL encoding is just a percentage sign and a hex number that corresponds to the character. But for characters outside that range, hex encoding consists of two or more %hex-number sequences.

For example, for the character that corresponds to hex value 56CE, URL encoding, according to standard .net/java APIs is not %56CE but "%e5%9b%8e"

So if we know the hex value for a character outside the ASCII character range, how is the URL encoding calculated? In other words, how does e5, 9b, 8e come out of 56CE? I tried converting to binary and did see a pattern for the last 2 numbers (%9b, %8e) but have no idea where the %e5 comes from.

Remy Lebeau · Accepted Answer

You have to encode the Unicode codepoints into charset bytes first, and then you can url-encode those bytes. In your example, E5 9B 8E are the UTF-8 encoded bytes of Unicode codepoint U+56CE, and then %E5%9B%8E is the url encoded form of the UTF-8 bytes.

how to calculate URL encoding for characters outside the ASCII character set?

Answers (1)

Related Questions