Reputation:
So I've been working on a scraper project.
Now I've implemented many things but I've been stuck on this one thing.
So first let me explain workflow: Scrapers are called in scraping-service
module, where I wait for the promise of the functions called to be resolved. Data is fetched in scrapers, and passed to the data_functions
object where data is: merged, validated and inserted into DB.
Now here is the code:
scraping-service
const olxScraper = require('./scrapers/olx-scraper');
const santScraper = require('./scrapers/sant-scraper');
//Calling scraper from where we want to get data about apartments
const data_functions = require('./data-functions/dataF');
let count = 1;
Promise.all([
olxScraper.olxScraper(count),
santScraper.santScraper(count),
]).then(() => data_functions.validateData(data_functions.mergedApartments));
So here I'm waiting for the promise of these two functions, and then passing merged data to validateData
method in the data_functions
.
Here is the scraper:
const axios = require('axios'); //npm package - promise based http client
const cheerio = require('cheerio'); //npm package - used for web-scraping in server-side implementations
const data_functions = require('../data-functions/dataF');
//olxScraper function which as paramater needs count which is sent in the scraping-service file.
exports.olxScraper = async (count) => {
const url = `https://www.olx.ba/pretraga?vrsta=samoprodaja&kategorija=23&sort_order=desc&kanton=9&sacijenom=sacijenom&stranica=${count}`;
//url where data is located at.
const olxScrapedData = [];
try {
await load_url(url, olxScrapedData); //pasing the url and empty array
} catch (error) {
console.log(error);
}
};
//Function that does loading URL part of the scraper, and starting of process for fetching raw data.
const load_url = async (url, olxScrapedData) => {
await axios.get(url).then((response) => {
const $ = cheerio.load(response.data);
fetch_raw_html($).each((index, element) => {
process_single_article($, index, element, olxScrapedData);
});
process_fetching_squaremeters(olxScrapedData); // if i place
//data_functions.mergeData(olxScrapedData); here it will work
});
};
//Part where raw html data is fetched but in div that we want.
const fetch_raw_html = ($) => {
return $('div[id="rezultatipretrage"] > div')
.not('div[class="listitem artikal obicniArtikal i index"]')
.not('div[class="obicniArtikal"]');
};
//Here is all logic for getting data that we want, from the raw html.
const process_single_article = ($, index, element, olxScrapedData) => {
$('span[class="prekrizenacijena"]').remove();
const getLink = $(element).find('div[class="naslov"] > a').attr('href');
const getDescription = $(element).find('div[class="naslov"] > a > p').text();
const getPrice = $(element)
.find('div[class="datum"] > span')
.text()
.replace(/\.| ?KM$/g, '')
.replace(' ', '');
const getPicture = $(element).find('div[class="slika"] > img').attr('src');
//making array of objects with data that is scraped.
olxScrapedData[index] = {
id: getLink.substring(27, 35),
link: getLink,
description: getDescription,
price: parseFloat(getPrice),
picture: getPicture,
};
};
//Square meters are needed to be fetched for every single article.
//This function loads up all links in the olxScrapedData array, and updating objects with square meters value for each apartment.
const process_fetching_squaremeters = (olxScrapedData) => {
const fetchSquaremeters = Promise.all(
olxScrapedData.map((item) => {
return axios.get(item.link).then((response) => {
const $ = cheerio.load(response.data);
const getSquaremeters = $('div[class="df2 "]')
.first()
.text()
.replace('m2', '')
.replace(',', '.')
.split('-')[0];
item.squaremeters = Math.round(getSquaremeters);
item.pricepersquaremeter = Math.round(
parseFloat(item.price) / parseFloat(getSquaremeters)
);
});
})
);
fetchSquaremeters.then(() => {
data_functions.mergeData(olxScrapedData); //Sending final array to mergeData function.
return olxScrapedData;
});
};
Now if I console.log(olxScrapedData)
in the fetchSquaremeters.then
it will output scraped apartments, but it doesn't want to call the function data_functions.mergeData(olxScrapedData)
. But if I add that block in the load_url
, it will trigger the functions and data is being merged, but without square meters things, and I really need that data.
So my question is, how to make this work? Do I need to call function somewhere else or?
What I want is just that this last olxScrapedData
be sent to this function mergeData
so that my arrays from different scrapers would be merged into one.
Thanks!
Edit: also here is the other scrapers how it looks: https://jsfiddle.net/oh03mp8t/. Note that in this scraper there is no any promises.
Upvotes: 0
Views: 77
Reputation: 81
Try adding this: const process_fetching_squaremeters = async (olxScrapedData) ...
and then await fetchSquaremeters.then(..)
.
James, in answer before told you what is happening. You must wait for this promise to be resolved, in order to all be executed correctly. If you don't have experience with async/await, promises, I suggest you watch some courses on them, to really understand what is happening here
Upvotes: 1
Reputation: 8106
Are you missing return/await statements from inside your promise/async statements, especially when your last statement is also a promise?
Without that, you may be simply asking the promise to be executed at a later time, rather than returning the result and making $.all() wait for it.
Upvotes: 0