Reputation: 318518
In a nodejs/express-based application I need to deal with GET requests which may contain umlauts encoded using the iso-8859-1 charset.
Unfortunately its querystring parser seems to handle only plain ASCII and UTF8:
> qs.parse('foo=bar&xyz=foo%20bar')
{ foo: 'bar', xyz: 'foo bar' } # works fine
> qs.parse('foo=bar&xyz=T%FCt%20T%FCt')
{ foo: 'bar', xyz: 'T%FCt%20T%FCt' } # iso-8859-1 breaks, should be "Tüt Tüt"
> qs.parse('foo=bar&xyz=m%C3%B6p')
{ foo: 'bar', xyz: 'möp' } # utf8 works fine
Is there a hidden option or another clean way to make this work with other charsets, too? The major problem with the default behaviour is that there is no way for me to know if there was a decoding error or not - after all, the input could have been something that simply decoded to something still looking like an urlencoded string.
Upvotes: 3
Views: 4211
Reputation: 140228
Well URL encoding should always be in UTF-8, other cases can be treated as encoding attack and just reject the request. There is no such thing as a non-utf8 character. I don't know why your application could get query strings in any encoding but you will be fine with browsers if you just use a charset header on your pages. For API requests or whatever, you can specify UTF-8 and reject invalid UTF-8 as Bad Request.
If you really mean ISO-8859-1, then it's very simple because the bytes match unicode code points exactly.
'T%FCt%20T%FCt'.replace( /%([a-f0-9]{2})/gi, function( f, m1 ) {
return String.fromCharCode(parseInt(m1, 16));
});
Although it is probably never ISO-8859-1 on the web but Windows-1252 actually.
Upvotes: 1
Reputation: 35829
Maybe node-iconv is a solution. Do you know before hand which encoding is used?
var qs = require('qs');
var Buffer = require('buffer').Buffer;
var Iconv = require('iconv').Iconv;
var parsed = qs.parse('foo=bar&xyz=T%FCt%20T%FCt');
var iconv = new Iconv('ISO-8859-1', 'UTF-8');
var buffer = iconv.convert(parsed.xyz);
var xyz = buffer.toString();
Upvotes: 0