Sijia Din
Sijia Din

Reputation: 1363

How to properly decode url with unicode in C

From my referrer log I'm trying to decode the referrers, but it looks like %81 and %8A are not valid percent encoding so I get ri�0�9o.

I need to send the decoded string through a websocket, right now I get Could not decode a text frame as UTF-8. on the browser side.

Are these even valid percent encodes? How can I know if they are valid or not?

#include <stdlib.h>
#include <ctype.h>
#include <stdio.h>

void urldecode2(char *dst, const char *src) {
    char a, b;
    while(*src) {
        if((*src == '%') && ((a = src[1]) && (b = src[2])) && (isxdigit(a) && isxdigit(b))) {
            if(a >= 'a')
                a -= 'a'-'A';
            if(a >= 'A')
                a -= ('A' - 10);
            else
                a -= '0';
            if(b >= 'a')
                b -= 'a'-'A';
            if(b >= 'A')
                b -= ('A' - 10);
            else
                b -= '0';
            *dst++ = 16*a+b;
            src+=3;
        } else if(*src == '+') {
            *dst++ = ' ';
            src++;
        } else {
            *dst++ = *src++;
        }
    }
    *dst++ = '\0';
}

int main () {
    const char *in = "http://www.google.co.in/search?q=cari%810%8A9o";
    char out[100];

    urldecode2(out, in);
    printf("%s\n", out);

    return 0;
}

Upvotes: 0

Views: 1263

Answers (1)

rici
rici

Reputation: 241711

%81 and %8A are perfectly valid %-escapes, but the result is not a UTF-8 string. URLs are not required to be UTF-8 strings, but these days they usually are.

It looks to me like some very strange double encoding has happened. There is no convention I know of which uses three-digit %-encodings, but that's what it looks like you have in that URL. On the assumption that the intention was to encode the Spanish word "cariño" (care, affection, fondness), it should have been cari%C3%B1o in UTF-8, or cari%F1o in ISO-8859-1/Windows-1252 (which usually show up in URLs by accident).

The rules for valid UTF-8 sequences are simple enough that you can check for a valid sequence using a regular expression. Not all valid sequences are mapped to characters, and 66 of them are mapped explicitly as "not characters", but all valid sequences should be accepted by a conforming decoder even if it later rejects the decoded character as semantically incorrect.

A UTF-8 sequence is a one-to-four byte sequence corresponding to one of the following patterns: (taken from the Unicode standard, table 3.7)

    Byte 1      Byte 2      Byte 3      Byte 4
    ------      ------      ------      ------
    00..7F        --          --          --
    C2..DF      80..BF        --          --
    E0          A0..BF      80..BF        --
    E1..EC      80..BF      80..BF        --
    ED          80..9F      80..BF        --
    EE..EF      80..BF      80..BF        --
    F0          90..BF      80..BF      80..BF
    F1..F3      80..BF      80..BF      80..BF
    F4          80..8F      80..BF      80..BF

Anything else is illegal. (So codes C0, C1 and F5 through FF cannot appear at all.) In particular, the hex codes 81 and 8A can never start a UTF-8 sequence.

Since there is no good way to know what might be meant by an invalid sequence, the simplest thing is just to strip them out.

Upvotes: 3

Related Questions