Blq56
Blq56

Reputation: 173

Get links with cheerio issue - NodeJS

In order to get all links from a webpage with Node JS using cheerio, I use these lines that work 90% of the time:

const request = require('request');
const cheerio = require('cheerio');

var url = 'an URL';
request(url, function(err, resp, body){
  $ = cheerio.load(body);
  links = $('a');
  $(links).each(function(i, link){
    console.log($(link).text());
  });
});

But for some websites, it doesn't work properly, for example: http://www.sylire.com/ http://www.bernieshoot.fr/

And I can't figure it out. Did someone could give me hints to solve this issue?

Note that I can normaly get all links for these website in browser console using :

var link = document.querySelectorAll("a");
for (var i of link){
  console.log(i.text);
}

Regards,

Upvotes: 2

Views: 3735

Answers (1)

Sebastiaan van Arkens
Sebastiaan van Arkens

Reputation: 447

It's because of the user-agent, you need to send one in your request to tell them that you are "an actual browser" visiting.

Example that works for me:

const request = require('request');
const cheerio = require('cheerio');

var url = 'http://www.sylire.com/';

var customHeaderRequest = request.defaults({
    headers: {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'}
})

customHeaderRequest.get(url, function(err, resp, body){
  $ = cheerio.load(body);
  links = $('a');
  $(links).each(function(i, link){
    console.log($(link).text());
  });
});

Upvotes: 4

Related Questions