Reputation: 23
I run a scrape site using nodejs to get the articles, I want to load Chinese website using XMLHttpRequest and the site is using this meta
<meta http-equiv="Content-Type" content="text/html; charset=gbk" />
and my site used charset UTF-8
xhr = new XMLHttpRequest();
xhr.open("GET", url, true);
xhr.setRequestHeader('Content-Type','text/html; charset=gbk');
xhr.onreadystatechange = function () {
//DOM Processing
$=cheerio.load(xhr.responseText);
};
xhr.send();
does anyone know what i have to set for the header ? I tried charset gbk / GB2312 also didn't work. Any help will be great. Thanks
Upvotes: 2
Views: 447
Reputation: 36
I think you are using: https://github.com/driverdan/node-XMLHttpRequest
In "Known Issues / Missing Features": Local file access may have unexpected results for non-UTF8 files
So I think this cannot be solved by node-XMLHttpRequest.
Here is my solution for scrape site using gbk, hope this is useful for you.
const rp = require('request-promise')
const cheerio = require('cheerio')
const iconv = require('iconv-lite')
const options = {
url: `http://www.duchang.org/`,
transform: function (body) {
let html = iconv.decode(body, 'gbk')
return cheerio.load(html)
},
encoding: null
}
rp(options)
.then(($) => {
// 首页头条
console.log($)
})
.catch(function (err) {
throw Error(err)
})
Upvotes: 1