Reputation: 23
I've ben working on making a simple web scraper to get a deeper understanding of using async/await and promise/resolve functionality. I'm having an issue with an async/await function not returning my resolve properly. I'm able to console.log the complete dataset within the function, right before the return resolve, but I'm not able to assign a constant and call the function outside the scope of the function itself, and console.log that data (it's returned as undefined).
This is the function that is supposed to return the dataset i'm scraping for. The saleHeaders are being pulled successfully, because I'm able to scrape for the content and assign to it's respective constant. When using the console.log that's commented out below, I get the complete dataset output.
async function getInnerPosts(saleHeaders) {
await Promise.all(saleHeaders.map(async (job) => {
const dataSet = [];
return new Promise(resolve => {
request(job.fullURL, (err, res, html) => {
const $ = cheerio.load(html);
$('.result-info').each((index, element) => {
const postTitle = $(element).children(".result-title").text();
const postDate = $(element).children(".result-date").attr('title');
const postLink = $(element).children("a").attr('href');
const postPrice = $(element).children(".result-meta").children(".result-price").text();
const postLocation = $(element).children(".result-meta").children(".result-hood").text().replace(/[{()}]/g, '');
// gather data to one const
const fetchedData = { postTitle, postDate, postLink, postPrice, postLocation };
dataSet.push(fetchedData);
// console.log(dataSet);
});
return resolve(dataSet);
});
});
}));
}
But when I try to run the function, assigning it to a constant, and then attempt to log that constant, I'm getting undefined without any warning/errors being thrown. I've attempted to re-structure the returning of the dataset and re-writing the complete function from scratch to make sure I didn't miss some minor mistake, but no luck.
async function scrapeData() {
const saleHeaders = await getForSaleHeader();
// Loop through the categories and pull the inner-data from the posts page
const innerPosts = await getInnerPosts(saleHeaders);
console.log(innerPosts);
}
This is the time, and output i'm receiving when running the above-mentioned code sample:
undefined
real 0m31.272s
user 0m33.594s
sys 0m0.322s
EDIT I am calling the whole script to run as well:
// run the script
scrapeData();
Upvotes: 0
Views: 1164
Reputation: 106037
The root of the problem is that you're mixing Promises with async
/await
. Remember that if a function returns a Promise (e.g. return new Promise
or return Promise.all
), then it should not be async
.
A smaller but just as significant problem is that getInnerPosts
doesn't return anything. It begins with await Promise.all(...)
. It should begin with return Promise.all(...)
. Without a return
there, getInnerPosts
returns undefined
, so the function that calls it is essentially just doing await undefined
, which immediately resolves to undefined
.
Let's break this down step by step. First, get most of this code out of the way by pulling the cheerio selector code into its own function:
function getData(element) {
const $el = $(element);
return {
postTitle: $el.children('.result-title').text(),
postDate: $el.children('.result-date').attr('title'),
postLink: $el.children('a').attr('href'),
postPrice: $el.children('.result-meta > .result-price').text(),
postLocation: $el.children('.result-meta > .result-hood').text().replace(/[{()}]/g, ''),
};
}
Next, get rid of the async
/await
s and add a return
before Promise.all
:
function getInnerPosts(saleHeaders) {
return Promise.all(saleHeaders.map(job => {
const dataSet = [];
return new Promise(resolve => {
request(job.fullURL, (err, res, html) => {
const $ = cheerio.load(html);
$('.result-info').each((index, element) => dataSet.push(getData(element)));
resolve(dataSet);
});
});
}));
}
This should be sufficient to make your code work as expected. However, that pyramid structure is not very easy to read. Let's see if we can improve it.
First, you can wrap your call to request
in a function that returns a Promise (alternatively you could use e.g. request-promise-native). Having done that, you can make the callback you pass to saleHeaders.map
an async
function and use await pRequest(...)
inside. Finally, you can eliminate dataSet
entirely by using cheerio's map
instead of calling push
inside an each
loop. The end result looks like this:
function pRequest(url) {
return new Promise((resolve, reject) =>
request(url, (err, res, html) => err ? reject(err) : resolve(html)));
}
function getInnerPosts(saleHeaders) {
return Promise.all(saleHeaders.map(async job => {
const html = await pRequest(job.fullURL);
const $ = cheerio.load(html);
return $('.result-info').map((index, element) => getData(element)));
}));
}
In reply to your comment:
Still trying to wrap my head around the difference between async/await and promises.
Understanding the difference—or, more importantly, the relationship—between async
/await
and promises is not trivial. It requires three or four layers of understanding. You need to understand how asynchronous callbacks (e.g. the function you pass to request
or setTimeout
) work in JS, then you need to understand how promises relate to callbacks, and finally how async
/await
relates to promises. It's a tough nut to crack, and I think most people pick it up just through experience.
If I have any advice, it's to slow down. Think about each part of your code and what it's actually doing. When you write this:
request(url, (err, res, body) => { /* … */ });
…what happens to that second argument (the callback) inside the request
function?
When you write this:
fetch(url)
.then(res => { /* … */ })
.catch (err => { /* … */ });
…what does fetch(url)
return, and what is then
? Why can you write this instead (in an async
function)?
try {
const res = await fetch(url);
// …
} catch (err) {
// …
}
Will it behave the same way, or are there differences?
Slow down and think about what you're writing. Don't add an await
unless you know why you're doing it. Don't add a new Promise
unless you know why you're doing it.
And finally, break your code up into smaller functions. It's easy to lose track of where you need to return
or where you want an await
when your function has a Promise.all
, a new Promise
, and three nested callbacks. It's much easier to tell if a function could be async
if it's only 2–3 lines long.
Is there any advantage of using one over the other?
IMO the main advantage of async
/await
is readability. Promise constructors and nested then
/catch
callbacks work, but they tend to obscure the intent of your code. async
/await
makes your code read more like normal synchronous code (not to mention "await" is an English word that's already familiar) which makes it easier to divine the intent. Of course, you still need Promises. await
makes a function effectively pause, but sometimes there's only one thing you want to defer while the rest of the function executes, so you need then
. And there's no replacement for Promise.all
Upvotes: 1
Reputation: 19967
async function getInnerPosts(saleHeaders) {
// here
return await Promise.all(saleHeaders.map(async (job) => {
const dataSet = [];
return new Promise(resolve => {
You have to return it because you want to further use the value outside the function.
const innerPosts = await getInnerPosts(saleHeaders);
If you don't, where do you think the value of innerPosts
comes from? It's returned from the getInnerPosts
!
Upvotes: 0
Reputation: 1134
The problem is that your return statement only returns inside the map function (map has a return statement but the function itself does not!
Upvotes: 1