Reputation: 45
I am using node.js
to open a list of web pages and parse the HTML
contents.
I supply the URLs inside the script as an array, then call request to retrieve the HTML
, which I then parse with Cheerio.
The problem I have is that some webpages do not list the URL inside the HTML
content.
So I want to determine the URL of the page that I am parsing from within my request callback.
Since request is asynchronous, I cannot rely on the outer loop (loops through the array of URL strings) to get the URL.
Any ideas?
var requestList = [ 'https://blahblah.com', 'https://blah2.com' ];
for (var i = 0; i < (requestList.length); i++) {
request(requestList[i], function (error, response, html) {
if (!error && response.statusCode == 200) {
var $ = cheerio.load(html);
...
// how can i determine the URL of this html body?
Thanks for any suggestions!
Upvotes: 1
Views: 142
Reputation: 10058
You can use instead #Array.forEach and use closure to capture the URL
requestList.forEach((url)=>{
request(url, (err,res,html) => {
console.log(url)
// rest of code here...
});
});
Why it works?
Closure captures all the references (that the current closure can reach by the scopes). it's a function that has its own memory (kind of)
For example, let's take a look at this code that you could also do that with a loop:
for (var i = 0; i < (requestList.length); i++) {
handleRequest(requestList[i]);
}
function handleRequest(url) {
// scope a
request(url, function (error, response, html) {
// scope b, (closure)
console.log(url);
// rest of the code
})
}
Since scope b
captures the values it can reach, it will remember the URL
variable
using closures sometimes can be dangerous because you can have memory leaks (when closure points to something from outside and something from outside points to something in the closure)
Upvotes: 3