Reputation: 6357
I'm using: Module: Request -- Simplified HTTP request method to scrape a webpage with accented characters á é ó ú ê ã
etc.
I've already tried encoding: utf-8
with no success. I'm still getting this ��� characters in the result.
request.get({
uri: url,
encoding: 'utf-8'
// ...
Is there any configuration to fix it?
I don't know if it is an issue, but I filled one for this module. No answers yet. :/
Upvotes: 20
Views: 23191
Reputation: 342
Not a direct answer to OP, but I hate a similar problem and might help someone.
I had the issue because there was a gzip compression, so it needs to be decompressed first
var headers = {
'Accept-Encoding': 'gzip',
};
request({url:url, 'headers': headers, encoding:null},(e,r,b)=>{zlib.gunzip(b, (e,b)=>{console.log(b.toString())}) })
Upvotes: 0
Reputation: 4652
I were tried and OK (Shift_JIS):
var concat = require('concat-stream'),
Iconv = require('iconv').Iconv,
request = require('request');
var conv = new Iconv('Shift_JIS', 'utf8'),
req = request('http://www.alc.co.jp/');
req.pipe(conv);
req.on('error', function() {
console.log('an error occurred');
});
conv.pipe(concat(function(body) {
console.log(body.toString());
}));
https://github.com/request/request/issues/1080#issuecomment-56172161
Upvotes: 0
Reputation: 5578
Since binary is deprecated it seems like a better idea to use iconv and correctly handle the decoding:
var request = require("request"), iconv = require('iconv-lite');
var requestOptions = { encoding: null, method: "GET", uri: "http://something.com"};
request(requestOptions, function(error, response, body) {
var utf8String = iconv.decode(new Buffer(body), "ISO-8859-1");
console.log(utf8String);
});
The important part is to set the encoding on the HTTP request to be null encoding: null
.
Upvotes: 27
Reputation: 8488
Specify the encoding as utf8
not utf-8
. Here are a list of possible encodings for a buffer from the Node.js documentation.
ascii
- for 7 bit ASCII data only. This encoding method is very fast, and will strip the high bit if set.utf8
- Unicode characters. Many web pages and other document formats use UTF-8.base64
- Base64 string encoding.'binary
- A way of encoding raw binary data into strings by using only the first 8 bits of each character. This encoding method is depreciated and should be avoided in favor of Buffer objects where possible. This encoding will be removed in future versions of Node.Upvotes: 2