Reputation: 641
This is my attempt at restructuring my code to utilize promises correctly. The whole program is a basic webscraper.
The challenge with this is trying to make sure that lastStep can access both the HTML and the URL of each page so I'm trying to return an object in nextStep()
.
I'm console logging the html and it's being returned correctly but for some reason, the promises are being logged like this: Promise { <pending> }
. Why is this happening and how can I fix it?
Thank you!
//Modules being used:
var cheerio = require('cheerio');
var json2csv = require('json2csv');
var request = require('request');
var moment = require('moment');
var fs = require('fs');
//harcoded url
var url = 'http://shirts4mike.com/';
//url for tshirt pages
var urlSet = new Set();
var remainder;
var tshirtArray = [];
const requestPromise = function(url) {
return new Promise(function(resolve, reject) {
request(url, function(error, response, html) {
if(error) return reject(error);
if(!error && response.statusCode == 200){
return resolve(html);
}
});
});
}
function scrape (url) {
return requestPromise(url)
.then(function(html) {
var $ = cheerio.load(html);
var links = [];
//get all the links
$('a[href*=shirt]').each(function(){
var a = $(this).attr('href');
//add into link array
links.push(url + a);
});
// return array of links
return links;
});
}
function nextStep (arrayOfLinks) {
var promiseArray = [];
for(var i = 0; i < arrayOfLinks.length; i++){
promiseArray.push(requestPromise(arrayOfLinks[i]));
var promises = Promise.all(promiseArray);
console.log(promises);
}
return {arrayOfHtml: promises , arrayOfUrls: arrayOfLinks};
}
function lastStep (obj){
for(var i = 0; i < obj.arrayOfHtml.length; i++){
var $ = cheerio.load(obj.arrayOfHtml[i]);
//if page has a submit it must be a product page
if($('[type=submit]').length !== 0){
//add page to set
urlSet.add(obj.arrayOfUrls[i]);
console.log(obj.arrayOfUrls[i]);
} else if(remainder == undefined) {
//if not a product page, add it to remainder so it another scrape can be performed.
remainder = obj.arrayOfUrls[i];
console.log("remainder: " + remainder);
}
}
}
scrape(url)
.then(nextStep)
.then(lastStep)
.catch(function(err) {
// handle any error from any request here
console.log(err);
});
Upvotes: 6
Views: 11209
Reputation: 113878
You have an unhandled else
in your code:
if(error) return reject(error);
if(!error && response.statusCode == 200){
return resolve(html);
}
Let's rearrange that to be more clear. Due to the return
the code above is exactly identical to this:
if(error) {
reject(error);
}
else if (response.statusCode == 200) {
resolve(html);
}
else {
// keep this promise pending FOREVER!!
}
You haven't handled the final else. Depending on your intention, the minimum change you can make to fix it is:
if(error) return reject(error);
if(!error && response.statusCode == 200){
return resolve(html);
}
reject(new Error('Not code 200'));
or
if(error) return reject(error);
if(!error && response.statusCode == 200){
return resolve(html);
}
resolve(html);
Still, I'd personally rewrite the logic to be clearer (the very fact that you missed the final else
is proof that the code is unclear).
Upvotes: 1
Reputation: 1161
A couple of things you could try. First, in your requestPromise
function, you don't need to return when you call 'resolve()' and reject()
. I don't know if that will make any difference, but you can at least try.
Next, as discussed in the comments, you should change how you reject and resolve the request promise. Most simply:
if(error) {
reject(error);
} else {
resolve(html);
}
Suppose there was no error (errors will only happen with 4xx or 5xx status code), but the status code was not 200? You could get anything in the 2xx or 3xx range and not get an error, in which case your requestPromise
would never be either resolved or rejected. That is sure to cause you issues because all promises have to end up one or the other.
The next issue is in nextStep
. I would refactor as follows:
function nextStep (arrayOfLinks) {
var promiseArray = [];
for(var i = 0; i < arrayOfLinks.length; i++){
promiseArray.push(requestPromise(arrayOfLinks[i]));
}
return Promise.all(promiseArray)
.then(function (arrayOfHtml) {
return {arrayOfHtml: promises , arrayOfUrls: arrayOfLinks};
});
}
With Promise.all
, you want to fill up your array of promises first, and then after you have made all of your asynchronous calls, that's when you call Promise.all(promisesArray)
. The extra then
on the end of all
will take the html that results from your promises array and then return it as a promise along with the arrayOfLinks
to the next step in your promise chain, which in this case is your lastStep
.
If none of this solves your problem, you will need to look back at the status code issue, I have had problems before where the status code was 202, which means that the request was accepted, but the processing of the request wasn't complete. (You can read more about HTTP status code here). It was a very similar situation, we had a bunch of urls that we were making requests to. We ended up putting all urls that got 202 back into a tryAgain
array and then tried hitting them again.
In your case, you have a could solve it a couple of ways. The simplest thing would be to reject the promise for all status codes except for 200, which would be a little strict. Another thing you could do is if there is no error and the status code is not 200, then you could resolve the promise with some special value, or simply the non-200 status code, that would signal you need to try again. Then after nextStep
, you could filter all your results that were resolved with a non-200 code, and try hitting them again. After that, you could then finish up with lastStep
. If you've tried everything else and none of it works, I would try one of these solutions. It would take some effort though.
Hope this helps. Let me know if you have any questions.
Upvotes: 2