How to handle GET parameters containing non-utf8 characters?

Question

In a nodejs/express-based application I need to deal with GET requests which may contain umlauts encoded using the iso-8859-1 charset.

Unfortunately its querystring parser seems to handle only plain ASCII and UTF8:

> qs.parse('foo=bar&xyz=foo%20bar')
{ foo: 'bar', xyz: 'foo bar' } # works fine
> qs.parse('foo=bar&xyz=T%FCt%20T%FCt')
{ foo: 'bar', xyz: 'T%FCt%20T%FCt' } # iso-8859-1 breaks, should be "Tüt Tüt"
> qs.parse('foo=bar&xyz=m%C3%B6p')
{ foo: 'bar', xyz: 'möp' } # utf8 works fine

Is there a hidden option or another clean way to make this work with other charsets, too? The major problem with the default behaviour is that there is no way for me to know if there was a decoding error or not - after all, the input could have been something that simply decoded to something still looking like an urlencoded string.

Esailija · Accepted Answer

Well URL encoding should always be in UTF-8, other cases can be treated as encoding attack and just reject the request. There is no such thing as a non-utf8 character. I don't know why your application could get query strings in any encoding but you will be fine with browsers if you just use a charset header on your pages. For API requests or whatever, you can specify UTF-8 and reject invalid UTF-8 as Bad Request.

If you really mean ISO-8859-1, then it's very simple because the bytes match unicode code points exactly.

'T%FCt%20T%FCt'.replace( /%([a-f0-9]{2})/gi, function( f, m1 ) {
    return String.fromCharCode(parseInt(m1, 16));
});

Although it is probably never ISO-8859-1 on the web but Windows-1252 actually.

How to handle GET parameters containing non-utf8 characters?

Answers (2)

Related Questions