Nodejs scraped content isn't properly decoded (weird question mark caracters)

Question

I am scraping content from a certain website with Nodejs. Said website is in Spanish, therefor it contains many special characters like á, é, í, ó, ú, ... and such.

When I review the content that my script has scraped, the special chracters appear as "�" (question mark inside a black squared diamond). Serching for a solution, I came across this similar SO question: Module request how to properly retrieve accented characters? � � � so I applied the suggested solution: I used iconv.decode(new Buffer(html), "ISO-8859-1"); to try to properly decode the characters. This time, special characters started appearing as ï¿½

Below is an excerpt of my code:

var request = require('request');

request('http://www.website.com/foo/bar/', function(err, resp, html) {
  if (err) {
    console.log("Error!");
  }

  html = iconv.decode(new Buffer(html), "ISO-8859-1");

  $ = cheerio.load(html);

  $('.x1').each(function() {

    url = ($(this).find('.ee').attr('src'));

    if ( typeof(url) !== 'string' ) {
      return true; // jump to next iteration
    }

    url = url.replace("/fp/", "/fg/");
    console.log("Foto = " + url);


    textData = ($(this).find('.tx').text());
    console.log("textData = " + textData);  // This variable contains the weird characters

Any idea on what I'm missing in order to properly scrape content with those special characters á, é, í, ó, ú, ...?

UPDATE: I also tried using binary instead of ISO-8859-1 and the strange characters started appearing as ï¿½

Xar · Accepted Answer

I finally got it working with the binary format.

I also had to do some changes. From this:

request('http://www.website.com/foo/bar/', function(err, resp, html) {

To this:

  var requestOptions = {
    uri: 'http://www.website.com/foo/bar/',
    encoding: null
  };

  request.get(requestOptions,function(err, resp, html) {

So basically I wasn't setting to null the request encoding option.

Nodejs scraped content isn't properly decoded (weird question mark caracters)

Answers (1)

Related Questions

Nodejs scraped content isn&#39;t properly decoded (weird question mark caracters)

Answers (1)

Related Questions

Nodejs scraped content isn't properly decoded (weird question mark caracters)