Sandra
Sandra

Reputation: 167

Why NodeJS request is returning weird characters?

I am trying to scrape some website. All websites are working well and return HTML content, except for amazon.com which returns weirds characters:

���$����C����~/�2��!Ҧ�@@
PK���                   ;y������~�R�{t�$�)�؊")  ��N
������S�b��Db���y��D.e��%G~g���ú�6~�zB}}=�9)��.w��`�'D:��&�5�\��}�4;_{�s��a�mXo��P�!oD��]]|��V�9VC����5�Yr���f�BL:bm���93G���u���<�Y�C0�Vim��h�ʕ�@�Sx�P|5e<,��]�5���`S�������h
         �����3���zI����(��$\�

I am just beginning in NodeJS and in my project so my code stays very simple:

request('https://www.amazon.com/', (error, response, body) => {
    console.log('error:', error); // Print the error if one occurred
    console.log('statusCode:', response && response.statusCode); // Print the response status code if a response was received
    console.log('body:', body); // Print the HTML for the Google homepage.
})

Everything appears correctly in Postman for amazon.com but not with console.log(). NPM Request is by default encoded in UTF8, that make others websites work but not him apparently… Thank you!

Upvotes: 2

Views: 2442

Answers (3)

jfriend00
jfriend00

Reputation: 707148

Amazon really, really wants to send you gzipped content and that is perhaps what you are seeing. When I run your exact code in node v12.13.1 on Windows 10 with request v2.88, I do not see the issue you mention.

But, I know from previous experience that Amazon tries to detect if gzip content should be OK and, if it thinks so, it will send it as gzipped. For example, if you add a user-agent header to the request such as "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36", then you will indeed get:

content-encoding: gzip

on the response.headers. In fact, if you do this:

console.log(response.headers)

You can see whether this is what is happening to you.

In my system, when I pass the request option gzip: false, then Amazon resumes sending plain text. If I set grip: true, then Amazon sends gzip encoded content, but the request library decodes it properly. So, it seems that the best option is this:

const request = require('request');

request({
    url: 'https://www.amazon.com/',
    gzip: true,
}, (error, response, body) => {
    console.log('error:', error); // Print the error if one occurred
    console.log('statusCode:', response && response.statusCode); // Print the response status code if a response was received
    console.log(response.headers);
    console.log('body:', body); // Print the HTML for the Google homepage.
})

I found this in the request doc:

For backwards-compatibility, response compression is not supported by default. To accept gzip-compressed responses, set the gzip option to true. Note that the body data passed through request is automatically decompressed while the response object is unmodified and will contain compressed data if the server sent a compressed response.


FYI, as Ashish mentioned the request() module has been put into maintenance mode and won't be enhanced in the future. For a new project, it's recommended to pick a newer module that has the functionality you're interested in. There are a lot to choose from. You can see a selection of them here.

A pretty simple one is got() which you would use like this:

 const got = require('got');

 got('https://www.amazon.com/').then(response => {
     console.log(response.body);
 }).catch(err => {
     console.log(err);
 });

Note: got() automatically handles gzip decompression from amazon.com just fine.

An option that works in both node.js and in the browser and has found a fair amount of popularity is axios():

const axios = require('axios');

axios('http://www.amazon.com').then(response => {
    console.log(response.data);
}).catch(err => {
    console.log(err);
});

Upvotes: 2

Ashish Modi
Ashish Modi

Reputation: 7770

I am not sure if request can handle all these different kids of encoding. Try some modern packages as request is currently not maintained (https://github.com/request/request/issues/3142).

This little code should do what you are looking for.

const got = require("got");
(async () => {
  const response = await got('https://www.amazon.com/');
  console.log(response.body);
})();

Upvotes: 2

mprather
mprather

Reputation: 170

It looks like that is encoded data. Try the following and see if it outputs a string:

console.log('body:', body.toString())

Upvotes: 0

Related Questions