Reputation: 7940
I am scraping content from a certain website with Nodejs. Said website is in Spanish, therefor it contains many special characters like á, é, í, ó, ú, ...
and such.
When I review the content that my script has scraped, the special chracters appear as "�" (question mark inside a black squared diamond). Serching for a solution, I came across this similar SO question: Module request how to properly retrieve accented characters? � � � so I applied the suggested solution: I used iconv.decode(new Buffer(html), "ISO-8859-1");
to try to properly decode the characters. This time, special characters started appearing as �
Below is an excerpt of my code:
var request = require('request');
request('http://www.website.com/foo/bar/', function(err, resp, html) {
if (err) {
console.log("Error!");
}
html = iconv.decode(new Buffer(html), "ISO-8859-1");
$ = cheerio.load(html);
$('.x1').each(function() {
url = ($(this).find('.ee').attr('src'));
if ( typeof(url) !== 'string' ) {
return true; // jump to next iteration
}
url = url.replace("/fp/", "/fg/");
console.log("Foto = " + url);
textData = ($(this).find('.tx').text());
console.log("textData = " + textData); // This variable contains the weird characters
Any idea on what I'm missing in order to properly scrape content with those special characters á, é, í, ó, ú, ...
?
UPDATE:
I also tried using binary
instead of ISO-8859-1
and the strange characters started appearing as �
Upvotes: 1
Views: 1126
Reputation: 7940
I finally got it working with the binary
format.
I also had to do some changes. From this:
request('http://www.website.com/foo/bar/', function(err, resp, html) {
To this:
var requestOptions = {
uri: 'http://www.website.com/foo/bar/',
encoding: null
};
request.get(requestOptions,function(err, resp, html) {
So basically I wasn't setting to null the request encoding
option.
Upvotes: 1