Nick Gordon
Nick Gordon

Reputation: 47

Scraping multiple pages using beautiful soup

I have got the code to scrape the first page, but the url changes from:

https://www.expansion.com/empresas-de/ganaderia/granjas-en-general/index.html --> https://www.expansion.com/empresas-de/ganaderia/granjas-en-general/2.html

import requests<br>
from bs4 import BeautifulSoup<br>
url = "https://www.expansion.com/empresas-de/ganaderia/granjas-en-general/index.html"<br>
page = requests.get(url)<br>
soup = BeautifulSoup(page.content, "html.parser")<br>
lists = soup.select("div#simulacion_tabla ul")<br>

for lis in lists:<br>
    title = lis.find('li', class_="col1").text<br>
    location = lis.find('li', class_="col2").text<br>
    province = lis.find('li', class_="col3").text<br>
    info = [title, location, province]<br>

How can I create a loop that would run from page 2 - page 65? Many thanks!

Upvotes: 1

Views: 3100

Answers (2)

Nicola Ballotta
Nicola Ballotta

Reputation: 301

First of all please be sure to format your code in the correct way so everyone can read it. Here you can find more.

This one could be a potential solution. Far from being optimized code, but you can take some inspiration.

import requests
from bs4 import BeautifulSoup

def scrape_page(url):
    """ Scrape the give url and return the bs4 ResultSet """
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    table = soup.select("div#simulacion_tabla ul")
    print(type(table))
    return table

def extract_rows(table):
    """ Extract rows """
    rows = []
    for row in table:
        title = row.find('li', class_="col1").text
        location = row.find('li', class_="col2").text
        province = row.find('li', class_="col3").text
        rows.append([title, location, province])
    return rows
        
big_table = []
index = scrape_page("https://www.expansion.com/empresas-de/ganaderia/granjas-en-general/index.html")
for row in extract_rows(index):
    big_table.append(row)

for x in range(2, 66):
    index = scrape_page("https://www.expansion.com/empresas-de/ganaderia/granjas-en-general/" + str(x) + ".html")
    for row in extract_rows(index):
        big_table.append(row)

print(big_table)

Upvotes: 1

Md. Fazlul Hoque
Md. Fazlul Hoque

Reputation: 16189

Here is the working solution:

import requests
from bs4 import BeautifulSoup
for page in range(1,65):
    url = "https://www.expansion.com/empresas-de/ganaderia/granjas-en-general/{page}.html".format(page =page)
    #print(url)
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    lists = soup.select("div#simulacion_tabla ul")

    for lis in lists:
        title = lis.find('li', class_="col1").text
        location = lis.find('li', class_="col2").text
        province = lis.find('li', class_="col3").text
        info = [title, location, province]
        print(info)

Upvotes: 1

Related Questions