I can't seem to produce a dataframe from a webpage table

Not sure where the problem is, but code is not giving the dataframe retrieved from the webpage. This is my first extract project and I can't seem to identify the problem.

This is the code:

import requests
import sqlite3
import pandas as pd
from bs4 import BeautifulSoup
from datetime import datetime 

url = 'https://en.wikipedia.org/wiki/List_of_largest_banks#By_market_capitalization'
db_name = 'Banks.db'
table_name = 'Largest_banks'
csv_path = '/home/project/Largest_banks_data.csv'
log_file = '/home/project/code_log.txt'  
table_attribs = {'Bank name': 'Name', 'Market Cap (US$ Billion)': 'MC_USD_Billion'}

###  Task 2 - Extract process

def extract(url, table_attribs):
# Loading the webpage for scraping
html_page = requests.get(url).text

# Parse the HTML content of the webpage
data = BeautifulSoup(html_page, 'html.parser')

# Find the table with specified attributes
# Find the main table containing the relevant data
main_table = data.find('table', class_='wikitable sortable')

# Find the desired `tbody` elements within the main table
table_bodies = main_table.find_all('tbody', attrs=table_attribs)

# Extract data from each `tbody` element
extracted_data = []
for table_body in table_bodies:
    rows = table_body.find_all('tr')
    for row in rows:
        extracted_data.append([cell.text for cell in row.find_all('td')])

# Use pandas to create a DataFrame from the extracted data
df = pd.DataFrame(extracted_data, columns=list(table_attribs.values()))

return df

# Calling the extract function
df = extract(url, table_attribs)

if df is not None:
# Print the result DataFrame
    print(df)
else:
    print("Extraction failed.")

Upvotes: 0

Views: 43

Answers (1)

Jack Fleeting
Jack Fleeting

Reputation: 24940

You could just read the page directly into pandas:

tables = pd.read_html(html_page)

This will load 3 dataframes, corresponding to the 3 tables on the page. You can then then print (or whatever) each table separately; for example

tables[0] 

will print out the first table ("By market capitalization").

Upvotes: 0

Related Questions