davidjones533
davidjones533

Reputation: 25

Scrape a table from multiple pages and store in a single dataframe

Problem: a website has c.80 pages, each of which contains a single table that is identically structured. I need to scrape each table and store the results in a single pandas dataframe. The table content is regularly updated, and therefore the exercise will need to be frequently repeated.

I can scrape the table from a single page but am struggling to do it for multiple pages. All of the examples I have found are for URLs that change iteratively, e.g. (www.example.com/page1, /page2 etc), rather than for a specified list of URLs.

I have tried the following for a subset of the URLs (ideally, I would like to read in the URLs from a csv list), but it only seems to scrape the final table into the dataframe (i.e. ZZ).

Apologies if this seems dim, I’m fairly new to Python and have mainly been using pandas for data analysis, reading in directly from csv. Any help would be gratefully appreciated.

How can I read the URLs from a csv list ? my current solution does not scrape the whole table as I expect.

from bs4 import BeautifulSoup
import requests
import pandas as pd

COLUMNS = ['ID', 'Serial', 'Aircraft', 'Notes']

urls = ['http://www.ukserials.com/results.php?serial=ZR',
'http://www.ukserials.com/results.php?serial=ZT',
'http://www.ukserials.com/results.php?serial=ZZ']
#scrape elements
for url in urls:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    table = soup.find("table") # Find the "table" tag in the page
    rows = table.find_all("tr") # Find all the "tr" tags in the table
    cy_data = [] 
    for row in rows:
        cells = row.find_all("td") #  Find all the "td" tags in each row 
        cells = cells[0:4] # Select the correct columns
        cy_data.append([cell.text for cell in cells]) # For each "td" tag, get the text inside it

data = pd.DataFrame(cy_data, columns=COLUMNS).drop(0, axis=0)

Upvotes: 1

Views: 2278

Answers (1)

DatHydroGuy
DatHydroGuy

Reputation: 1116

Can you not add each dataframe into a list, and then merge the elements of that list right at the end?

...
dataframes = []
for url in urls:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    table = soup.find("table") # Find the "table" tag in the page
    rows = table.find_all("tr") # Find all the "tr" tags in the table
    cy_data = []
    for row in rows:
        cells = row.find_all("td") #  Find all the "td" tags in each row
        cells = cells[0:4] # Select the correct columns
        cy_data.append([cell.text for cell in cells]) # For each "td" tag, get the text inside it

    dataframes.append(pd.DataFrame(cy_data, columns=COLUMNS).drop(0, axis=0))

data = pd.concat(dataframes)

Note: You might need to specify index offsets for each dataframe (before merging), as seen here: https://pandas.pydata.org/pandas-docs/stable/merging.html

Upvotes: 2

Related Questions