How to determine URL when parsing HTML from node.js

Question

I am using node.js to open a list of web pages and parse the HTML contents.

I supply the URLs inside the script as an array, then call request to retrieve the HTML, which I then parse with Cheerio.

The problem I have is that some webpages do not list the URL inside the HTML content.

So I want to determine the URL of the page that I am parsing from within my request callback.

Since request is asynchronous, I cannot rely on the outer loop (loops through the array of URL strings) to get the URL.

Any ideas?

var requestList = [ 'https://blahblah.com', 'https://blah2.com' ];
for (var i = 0; i < (requestList.length); i++) {  
  request(requestList[i], function (error, response, html) {
    if (!error && response.statusCode == 200) {
      var $ = cheerio.load(html);
      ...
      // how can i determine the URL of this html body?

Thanks for any suggestions!

Daniel Krom · Accepted Answer

You can use instead #Array.forEach and use closure to capture the URL

requestList.forEach((url)=>{

    request(url, (err,res,html) => {
         console.log(url)
        // rest of code here...
    });
});

Why it works?

Closure captures all the references (that the current closure can reach by the scopes). it's a function that has its own memory (kind of)

For example, let's take a look at this code that you could also do that with a loop:

for (var i = 0; i < (requestList.length); i++) {
    handleRequest(requestList[i]);
}

function handleRequest(url) {
    // scope a
    request(url, function (error, response, html) {
        // scope b, (closure)
        console.log(url);
        // rest of the code
    })
}

Since scope b captures the values it can reach, it will remember the URL variable

using closures sometimes can be dangerous because you can have memory leaks (when closure points to something from outside and something from outside points to something in the closure)

How to determine URL when parsing HTML from node.js

Answers (1)

Related Questions