Reputation: 35
Hello i am using axios with cheerio to scrape some data.I want to scrape multiple pages, the url structure is like example.com/?page=1.How i can scrape every single page with a counter ?
axios({
method: "get",
url:
"https://example.com/?page=",
headers: {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"
}
}).then(res => {
Upvotes: 2
Views: 2871
Reputation: 183
I believe there are multiple ways to achieve that solution but basically you need to execute all the axios
and parse all of them with Cheerio programatically.
You can create a simple for
loop and push all the axios
functions into an array one by one with the generated urls. Then you can call these with Promise.all
const promises = [];
for(let page = 0; page <= 5; page ++){
promises.push(
axios({method: "get",url:`https://example.com?page=${page}`})
.then(res => {
// Parse your result with Cheerio or whatever you like
})
);
}
// You can pass the responses on this resolve if you want.
Promise.all(promises).then(...)
Then you can create a async/recursive function for dispatching the request with axios
and conditionally iterate. With that way you can also reduce maximum usage of memory when you compare with the solution below. And it will be slower because the requests will not be in parallel.
// The function below is kind-of pseudo code so don't try to copy/paste it :)
const dispatchRequest = (page) => {
const response = axios({url: `https://example.com?page=${page}`});
// Ex: You can parse the response here with Cheerio and check if pagination is not disable
if(something){
return dispatchRequest(page+1);
}
else{
return response;
}
}
The solutions above has down-sides of course. If you get blocked by target website or somehow your request fails, you have no chance to retry the same request or rotate your proxies to bypass the target websites security.
I'd suggest you to implement a queue
and put all of the request dispatch functions there. With that way you can detect fails/problems and enqueue the failed requests again. You can also implement both of the solutions above with a queue
support. You can run it in parallel and manage your memory/CPU consuming way much better.
Also you can use SDKs too. I saw there are couple of scraping SDKs with provides you this whole toolset so you won't re-invent the wheel.
Upvotes: 3