Brendan Rodgers
Brendan Rodgers

Reputation: 305

(Python 3, BeautifulSoup 4) - Scraping Pagination in Div

I can scrape the first page of this site:

http://ratings.food.gov.uk/enhanced-search/en-GB/%5E/London/Relevance/0/%5E/%5E/0/1/10

But I am trying to scrape all the other pages on the site by using the "Next" button in the pagination of the site.

I have clicked the Next button and I can see that the parameter that changes is from 0/1/10 to 0/2/10 for page 2 and so on.

I have looked at the pagination code and I can see that the pagination is in a Div

 <div id="pagingNext" class="link-wrapper">

The issue is I have only successfully scraped pagination from another site using the code below:

button_next = soup.find("a", {"class": "btn paging-next"}, href=True)
while button_next:
    time.sleep(2)#delay time requests are sent so we don't get kicked by server
    soup=makesoup(url = "https://www.propertypal.com{0}".format(button_next["href"]))

This worked, but as this site I am currently scraping doesn't seem to provide a a href for the Next Button URL I am lost as to how to try and scrape it

I tried:

button_next = soup.find("div", {"class": "paging-Next"})
while button_next:
    time.sleep(2)#delay time requests are sent so we don't get kicked by server
    soup=makesoup(url = "https://www.propertypal.com{0}".format(button_next))

But it just doesn't seem to scrape the other pages, just the first page still.

If anyone can provide help I would be extremely appreciative.

Thanks

Upvotes: 1

Views: 1244

Answers (3)

SIM
SIM

Reputation: 22440

This is the best approach in this case to exhaust all the pages without even knowing how many pages It has spread across as sir t.m.adam has already mentioned. Give this a try. It'll give you all the names.

import requests
from bs4 import BeautifulSoup

target_url = url = "http://ratings.food.gov.uk/enhanced-search/en-GB/%5E/London/Relevance/0/%5E/%5E/0/{}/10"

page_num = 1
while True:
    response = requests.get(target_url.format(page_num))
    if response.status_code == 404: # break once the page is not found
        break
    print("Scraping Page Number {}".format(page_num))
    soup = BeautifulSoup(response.text,"lxml")
    for item in soup.findAll("div",class_="ResultsBusinessName"):
        name = item.findAll("a")[0].text
        print(name.strip())

    page_num += 1

Upvotes: 1

Vin&#237;cius Figueiredo
Vin&#237;cius Figueiredo

Reputation: 6508

There's no need to verify button_next url since you already know how the urls change accross all pages. So, instead of using url "http://ratings.food.gov.uk/enhanced-search/en-GB/%5E/London/Relevance/0/%5E/%5E/0/1/10" I'd recommend using "http://ratings.food.gov.uk/enhanced-search/en-GB/%5E/London/Relevance/0/%5E/%5E/0/1/50", the website provides this option for seeing 50 items at once, so instead of iterating through 4044 you are only going through 809 pages.

In the while loop we are waiting for current to be 810, so we know the last page was scraped because by inspection, /809/50 is the last page.

import requests
from bs4 import BeautifulSoup

current = 0
while current < 810:  # Last page, by inspection is /809/50
    url = "http://ratings.food.gov.uk/enhanced-search/en-GB/%5E/London/Relevance/0/%5E/%5E/0/{:d}/50".format(current)
    data = requests.get(url).text 
    soup = BeautifulSoup(data, "html.parser")
    print(url)
    current += 1
    #  Do your scraping here

Upvotes: 3

Dmitriy Fialkovskiy
Dmitriy Fialkovskiy

Reputation: 3225

workaround:

while you have True in your check of Next button, you can manually create links and open them in loop by incrementing numeric tail, like you wrote: from 0/1/10 to 0/2/10 for page 2 and so on.

something like this:

base_ur = 'http://ratings.food.gov.uk/enhanced-search/en-GB/%5E/London/Relevance/0/%5E/%5E/0/' # deleting 1/10

incr = 0
while button_next:
    incr+=1
    next_url = base_url + str(incr)+'/10'
    page = urllib.requests.urlopen(next_url)
    (and then scraping goes)

Upvotes: 3

Related Questions