Arthur Law
Arthur Law

Reputation: 131

Download Income Statement from web page using BeautifulSoup and convert to Pandas dataframe?

I am trying to grab the Income Statement table OF McDonald's Corporation (MCD) "https://finance.yahoo.com/quote/MCD/financials?p=MCD". I used beaufiful soup. The html is downloaded, but there seems no typical "tr", "td" tag for the income statement table. How to convert the income statement table into df dataframe?

my codes:

url="https://finance.yahoo.com/quote/MCD/financials?p=MCD"
result = requests.get(url)
result.raise_for_status()
result.encoding = "utf-8"


src = result.content
soup = BeautifulSoup(src, 'lxml')
print(soup)

array = []
for tr_tag in soup.find_all('tr'):
    b_tag = tr_tag.find_all('td')
    array.append(b_tag)
print(array)

Upvotes: 1

Views: 137

Answers (2)

gmdev
gmdev

Reputation: 3155

"Download Income Statement from web page using BeautifulSoup..."

First, you say soup.find_all('tr'); however, there are no tr tags in the income statements table. On the website, each row has a div tag which then has a specific class. Specifying the class can really help you tell the program exactly what you want from the website. I used the div class of "D(tbr) fi-row Bgc($hoverBgColor):h" because it is consistent across each row of the table. You can then use the text function to get the raw text from the website instead of the HTML.

url="https://finance.yahoo.com/quote/MCD/financials?p=MCD"
result = requests.get(url)
result.raise_for_status()
result.encoding = "utf-8"

src = result.content
soup = BeautifulSoup(src, 'lxml')

rows = []
for i in soup.find_all('div',{'class':'D(tbr) fi-row Bgc($hoverBgColor):h'}):
    row = i.text
    rows.append(row)

print(rows)

Upvotes: 1

Jack Fleeting
Jack Fleeting

Reputation: 24940

As mentioned in the comments, here's your step 1:

targets = soup.find("div",{'data-reactid':'41'})
rows = []
for target in targets:
    data = target.find_all('span')
    row = []
    for d in data:
        row.append(d.text)
    rows.append(row)
for row in rows:
    print(row)

output:

['Total Revenue', '21,076,500', '21,025,200', '22,820,400', '24,621,900']
['Cost of Revenue', '9,961,200', '10,239,200', '12,199,600', '14,417,200']
['Gross Profit', '11,115,300', '10,786,000', '10,620,800', '10,204,700']
['Operating Expenses', 'Research Development', 'Selling General and Administrative', '2,229,400', '2,200,200', '2,231,300', '2,384,500', 'Total Operating Expenses', '2,045,500', '2,200,200', '2,231,300', '2,384,500']

etc.

Upvotes: 1

Related Questions