Reputation: 7461
I am learning to use Beautiful Soup to scrape some info from a website. The website has multiple search results pages that I want to scrape.
This is simple, as the URL changes for each page:
website.com/page1
website.com/page2
.
.
But I don't know in advance how many pages there will be. So I don't want to try to scrape website.com/page13
if there isn't one or if website.com/page13
just shows the last results page which may have been website.com/page9
.
Is there a way I can stop scraping when I reach the final results page?
Upvotes: 0
Views: 304
Reputation: 1539
Often search pages have results with some sort of indexing. If the page you are looking at has said indexing you can stop when you see the same index twice.
Additionally you may run into pagination of results at the bottom of the page and you can tell from what page you are on whether you are at the end of the pagination in that list.
Furthermore, search pages usually have a set number of results displayed on each page, so in those cases you can assume that the page you are on is the last page if the results are suddenly fewer than that.
Another way to differentiate in the case of repeated pages would be to keep the first result from the current page and compare it to the first result of the next page, if they are the same then you are done.
If you can give more detail on the page you are trying this on or more details on the scope of the problem I may give additional input.
Upvotes: 1