Reputation: 9866
I'm trying to parse a specification website from saved HTML on my computer. I can post the file upon request.
I'm burnt out trying to figure out why it won't run synchronously. The comments should log the CCCC
's first, then BBBB
's, then finally one AAAA
.
The code I'm running will not wait at the first hurdle (it prints AAAA...
first). Am I using request-promise
incorrectly? What is going on?
Is this due to the .each()
method of cheerio
(I'm assuming it's synchronous)?
const rp = require('request-promise');
const fs = require('fs');
const cheerio = require('cheerio');
async function parseAutodeskSpec(contentsHtmlFile) {
const topics = [];
const contentsPage = cheerio.load(fs.readFileSync(contentsHtmlFile).toString());
const contentsSelector = '.content_htmlbody table td div div#divtreed0e338374 nobr .toc_entry a.treeitem';
contentsPage(contentsSelector).each(async (idx, topicsAnchor) => {
const topicsHtml = await rp(topicsAnchor.attribs['href']);
console.log("topicsHtml.length: ", topicsHtml.length);
});
console.log("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA");
return topics;
}
Upvotes: 1
Views: 107
Reputation: 9866
Based on the other answers here I came to a rather elegant conclusion. Note the avoidance of async
/await
in the .map()
callback, as cheerio
's callbacks (and from what I've learned about async
/await
, generally all callbacks) seem not to honour the synchronous nature of await
well:
async function parseAutodeskSpec(contentsHtmlFile) {
const contentsPage = cheerio.load(fs.readFileSync(contentsHtmlFile).toString());
const contentsSelector = '.content_htmlbody table td div div#divtreed0e338374 nobr .toc_entry a.treeitem';
const contentsReqs = contentsPage(contentsSelector)
.map((idx, elem) => rp(contentsPage(elem).attr('href')))
.toArray();
const topicsReqs = await Promise.all(contentsReqs)
.map(req => parseAutodeskTopics(req));
return await Promise.all(topicsReqs);
}
Upvotes: 1
Reputation: 16344
As @lumio stated in his comment, I also think that this is because of the each
function being synchrone.
You should rather use the map
method, and use the Promise.all()
on the result to wait enough time:
const obj = contentsPage(contentsSelector).map(async (idx, topicsAnchor) => {
const topicsHtml = await rp(topicsAnchor.attribs['href']);
console.log("topicsHtml.length: ", topicsHtml.length);
const topicsFromPage = await parseAutodeskTopics(topicsHtml)
console.log("topicsFromPage.length: ", topicsFromPage.length);
topics.concat(topicsFromPage);
})
const filtered = Object.keys(obj).filter(key => !isNaN(key)).map(key => obj[key])
await Promise.all(filtered)
console.log("AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA");
Upvotes: 1
Reputation: 54984
Try it this way:
let hrefs = contentsPage(contentsSelector).map((idx, topicsAnchor) => {
return topicsAnchor.attribs['href']
}).get()
let topicsHtml
for(href of hrefs){
topicsHtml = await rp(href);
console.log("topicsHtml.length: ", topicsHtml.length);
}
Now the await is outside of map or each which doesn't quite work the way you think.
Upvotes: 1