Looping through web pages to webscrape data

Question

I'm trying to loop through Zillow pages and extract data. I know that the URL is being updated with a new page number after each iteration but the data extracted is as if the URL is still on page 1.

import selenium
from selenium import webdriver
import requests
from bs4 import BeautifulSoup
import pandas as pd

next_page='https://www.zillow.com/romeo-mi-48065/real-estate-agent-reviews/'

num_data1=pd.DataFrame(columns=['name','number'])

browser=webdriver.Chrome()
browser.get('https://www.zillow.com/romeo-mi-48065/real-estate-agent-reviews/')

while True:

    page=requests.get(next_page)

    contents=page.content

    soup = BeautifulSoup(contents, 'html.parser')

    number_p=soup.find_all('p', attrs={'class':'ldb-phone-number'},text=True)
    name_p=soup.find_all('p', attrs={'class':'ldb-contact-name'},text=True)

    number_p=pd.DataFrame(number_p,columns=['number'])
    name_p=pd.DataFrame(name_p,columns=['name'])

    num_data=number_p['number'].apply(lambda x: x.text.strip())
    nam_data=name_p['name'].apply(lambda x: x.text.strip())

    number_df=pd.DataFrame(num_data,columns=['number'])
    name_df=pd.DataFrame(nam_data,columns=['name'])

    num_data0=pd.concat([number_df,name_df],axis=1)

    num_data1=num_data1.append(num_data0)

        try:

            button=browser.find_element_by_css_selector('.zsg-pagination>li.zsg-pagination-next>a').click()
            next_page=str(browser.current_url)

        except IndexError:

            break

Dean W. · Accepted Answer

Replace page=requests.get(next_page) with page = browser.page_source

Basically what's happening is that you're going to the next page in Chrome, but then trying to load that page's url with requests which is getting redirected back to page one by Zillow (probably because it doesn't have the cookies or appropriate request headers).

Looping through web pages to webscrape data

Answers (2)

Related Questions