Reputation: 25
Problem: a website has c.80 pages, each of which contains a single table that is identically structured. I need to scrape each table and store the results in a single pandas dataframe. The table content is regularly updated, and therefore the exercise will need to be frequently repeated.
I can scrape the table from a single page but am struggling to do it for multiple pages. All of the examples I have found are for URLs that change iteratively, e.g. (www.example.com/page1, /page2 etc), rather than for a specified list of URLs.
I have tried the following for a subset of the URLs (ideally, I would like to read in the URLs from a csv list), but it only seems to scrape the final table into the dataframe (i.e. ZZ).
Apologies if this seems dim, I’m fairly new to Python
and have mainly been using pandas
for data analysis, reading in directly from csv
. Any help would be gratefully appreciated.
How can I read the URLs from a csv list ? my current solution does not scrape the whole table as I expect.
from bs4 import BeautifulSoup
import requests
import pandas as pd
COLUMNS = ['ID', 'Serial', 'Aircraft', 'Notes']
urls = ['http://www.ukserials.com/results.php?serial=ZR',
'http://www.ukserials.com/results.php?serial=ZT',
'http://www.ukserials.com/results.php?serial=ZZ']
#scrape elements
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
table = soup.find("table") # Find the "table" tag in the page
rows = table.find_all("tr") # Find all the "tr" tags in the table
cy_data = []
for row in rows:
cells = row.find_all("td") # Find all the "td" tags in each row
cells = cells[0:4] # Select the correct columns
cy_data.append([cell.text for cell in cells]) # For each "td" tag, get the text inside it
data = pd.DataFrame(cy_data, columns=COLUMNS).drop(0, axis=0)
Upvotes: 1
Views: 2278
Reputation: 1116
Can you not add each dataframe into a list, and then merge the elements of that list right at the end?
...
dataframes = []
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
table = soup.find("table") # Find the "table" tag in the page
rows = table.find_all("tr") # Find all the "tr" tags in the table
cy_data = []
for row in rows:
cells = row.find_all("td") # Find all the "td" tags in each row
cells = cells[0:4] # Select the correct columns
cy_data.append([cell.text for cell in cells]) # For each "td" tag, get the text inside it
dataframes.append(pd.DataFrame(cy_data, columns=COLUMNS).drop(0, axis=0))
data = pd.concat(dataframes)
Note: You might need to specify index offsets for each dataframe (before merging), as seen here: https://pandas.pydata.org/pandas-docs/stable/merging.html
Upvotes: 2