user16019366
user16019366

Reputation:

Scraping: scrape multiple pages in looping (Beautifulsoup)

I am trying to scrape real estate data using Beautifulsoup, but when I save the result of the scrape to a .csv file, it only contains the information from the first page. I would like to scrape the number of pages I have set in the "pages_number" variable.

# How many pages
pages_number =int(input('How many pages? '))
# inicializa o tempo de execução
tic = time.time()

# Chromedriver

chromedriver = "./chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)


#initial link
link = 'https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page=1'
driver.get(link)

# creating looping pages

for page in range(1,pages_number+1):
    time.sleep(15)
    data = driver.execute_script("return document.getElementsByTagName('html' [0].innerHTML")
    soup_complete_source = BeautifulSoup(data.encode('utf-8'), "lxml")


I already tried this solution but got an error:

link = 'https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page={}.format(page)'

Does anyone have any idea what can be done?

COMPLETE CODE

https://github.com/arturlunardi/webscraping_vivareal/blob/main/scrap_vivareal.ipynb

Upvotes: 0

Views: 394

Answers (1)

Anand Gautam
Anand Gautam

Reputation: 2101

I see that the url you are using belongs to page 1 only.

https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page=1

Are you changing it anywhere in your code? If not, then no matter what you fetch, it would fetch from page 1 only.

You should do something like this:

    for page in range(1,pages_number+1):
        chromedriver = "./chromedriver"
        os.environ["webdriver.chrome.driver"] = chromedriver
        driver = webdriver.Chrome(chromedriver)

        #initial link
        link = f"https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page={page}"
        driver.get(link)
        time.sleep(15)
        data = driver.execute_script("return document.getElementsByTagName('html' [0].innerHTML")
        soup_complete_source = BeautifulSoup(data.encode('utf-8'), "lxml")
        driver.close()

Test Output (not the soup part) - for pages_number = 3 (stored urls in a list, for easy view):

['https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page=1', 'https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page=2', 'https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page=3']

Process finished with exit code 0

Upvotes: 1

Related Questions