Reputation: 674
I have the following function to gather all the prices but I am having issues scraping the total number of pages. How would I be able to scrape through all the pages without knowing the amount of pages there are?
import requests
from bs4 import BeautifulSoup
import pandas as pd
import itertools
def get_data(page):
url = 'https://www.remax.ca/bc/vancouver--real-estate?page='+page
page = requests.get(url)
soup = BeautifulSoup(page,'html.parser')
price = soup.find_all('h3', {'class' : 'price'})
price = pd.DataFrame([(p.text) for p in price]).rename(columns = {0:'Price'})
return price
what I tried but doesn't seem to work
for pages in itertools.count(start=1):
try:
table = get_data('1').append(table)
except Exception:
break
Upvotes: 0
Views: 937
Reputation: 1079
This is a great opportunity for recursion, provided that you do not anticipate more than 1000 pages, because I think Python only allows a maximum stack depth of 1000:
import requests
from bs4 import BeautifulSoup
import pandas as pd
def get_prices(page=1, prices=[], depth=0, max_depth=100):
if depth >= max_depth:
return prices
url = 'https://www.remax.ca/bc/vancouver--real-estate?page={page}'.format(page=page)
r = requests.get(url)
if not r:
return prices
if r.status_code != 200:
return prices
soup = BeautifulSoup(r.text, 'html.parser')
price = soup.find_all('h3', {'class' : 'price'})
price = pd.DataFrame([(p.text) for p in price]).rename(columns = {0:'Price'})
prices.append(price)
return get_prices(page=page+1, prices=prices, depth=depth+1)
prices = get_prices()
So the get_prices function first calls itself with the default parameters. Then it keeps on calling itself and appending additional prices to the prices function each time it does so, until it either reaches the point where the next page does not yield the status code 200, or it reaches the maximum recursion depth that you have specified.
Alternatively, if you don't like recursion or you need to query more than 1000 pages at a time, then you could use a simpler, but less interesting, while loop:
import requests
from bs4 import BeautifulSoup
import pandas as pd
def get_prices():
prices=[]
page = 1
while True:
url = 'https://www.remax.ca/bc/vancouver--real-estate?page={page}'.format(page=page)
r = requests.get(url)
if not r:
break
if r.status_code != 200:
break
soup = BeautifulSoup(r.text, 'html.parser')
price = soup.find_all('h3', {'class' : 'price'})
price = pd.DataFrame([(p.text) for p in price]).rename(columns = {0:'Price'})
prices.append(price)
page += 1
return prices
prices = get_prices()
Upvotes: 2
Reputation: 457
Try with this
def get_data(price, page):
url = 'https://www.remax.ca/bc/vancouver--real-estate?page='+page
page = urlopen(url)
soup = BeautifulSoup(page,'html.parser')
price = soup.find_all('h3', {'class' : 'price'})
price = pd.DataFrame([(p.text) for p in price]).rename(columns = {0:'Price'})
price = dict()
for page in itertools.count(start=1):
try:
get_data(price, str(page))
except Exception:
break
Upvotes: 0
Reputation: 310
Maybe you should change "get_data('1')" by "get_data(str(page))"?
Upvotes: -1