Reputation: 83
I want to scrape information using BeautifulSoup and iterate through multiple pages. I know how to do this by writing for page in range(1, 3)
for example if I want the info on first 2 pages. However, the information is dynamic and the number of pages will increase. So how can I iterate when I don't know the last page? Currently I have the following code:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
headers = {'user-agent': 'Mozilla/5.0'}
listing_details = []
for page in range(1,3):
response = requests.get('https://www.realestate.co.nz/residential/sale/auckland?by=latest&oad=true&page={}&pm=1'.format(page), headers=headers)
listings = BeautifulSoup(response.content, "lxml")
details = listings.findAll('div', attrs={"data-test":"tile"})
for detail in details:
# get property links
links = detail.findAll('a', href=True)
for link in links:
link="https://www.realestate.co.nz" + link['href']
listing_details.append([link])
df3 = pd.DataFrame(listing_details, columns=['Link'])
print(df3)
Upvotes: 0
Views: 2416
Reputation: 84465
You can do a while True and break when next
element is no longer present (or add in chosen end page number - which ever comes first)
import requests
from bs4 import BeautifulSoup as bs
page = 0
with requests.Session() as s:
s.headers = {'User-Agent':'Mozilla/5.0'}
while True:
page+=1
r = s.get(f'https://www.realestate.co.nz/residential/sale/auckland?by=latest&oad=true&page={page}&pm=1')
soup = bs(r.content, 'lxml')
next_page = soup.select_one('[data-test=next-link]')
if next_page is None:
break
print(page)
You could also calculate from info provided within a script tag (this shows the idea of adding in a target number of pages):
import requests, re
from bs4 import BeautifulSoup as bs
import math
target_pages = 3
with requests.Session() as s:
s.headers = {'User-Agent':'Mozilla/5.0'}
r = s.get(f'https://www.realestate.co.nz/residential/sale/auckland?by=latest&oad=true&page=1&pm=1')
counts = re.search(r'"totalResults\\":(?P<total>\d+),\\"resultsPerPage\\":(?P<perpage>\d+),', r.text, re.M)
num_pages = math.ceil(int(counts.group('total'))/int(counts.group('perpage')))
#print(int(counts.group('total')))
print(num_pages)
n = 2
while n <= min(num_pages, target_pages):
r = s.get(f'https://www.realestate.co.nz/residential/sale/auckland?by=latest&oad=true&page={n}&pm=1')
print(n)
n+=1
Upvotes: 1
Reputation: 322
One solution might be create a while loop and iter in the pages, adding +1 to each loop. When the page content is broke or status code 404, break.
Upvotes: 0