parsing whole html table with beautiful soup

Question

I am trying to use beautiful soup to scrape a html table into pandas.

The url is https://www.investing.com/equities/exxon-mobil-income-statement

I identified the table in the HTML code (id="rrtable"), but I'm stumbling on getting this parsed and into a pandas dataframe.

The website was returning a 403 error, so I had to first set the headers to bypass the 403.

I'm expecting to see a dataframe with 5 columns and rows of financial data, but instead I just get un-parsed headers and no content. Where is this going wrong?

#!/usr/local/bin/python3

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
url = "https://www.investing.com/equities/exxon-mobil-income-statement"
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')

table = soup.find_all(id="rrtable")

df = pd.DataFrame(table)

print(df)

Any help would be much appreciated!

Thank you

αԋɱҽԃ αмєяιcαη · Accepted Answer

import requests
import pandas as pd

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'}
r = requests.get(
    "https://www.investing.com/equities/exxon-mobil-income-statement", headers=headers)
df = pd.read_html(r.content)[1]

df.to_csv("data.csv", index=False)

output: view-online

Output Sample:

parsing whole html table with beautiful soup

Answers (1)

Related Questions