TheAlmostGreat
TheAlmostGreat

Reputation: 19

Unable to scrape data from json page using Python

I'm trying to pull information off of this web page (Which is providing an AJAX call to this page).

I'm able to print out the whole page, but the find_all function just returns a blank list. What am I doing wrong?

from bs4 import BeautifulSoup
import requests

url = "http://financials.morningstar.com/finan/financials/getFinancePart.html?&callback=jsonp1653673850875&t=XNAS:AAPL&region=usa&culture=en-US&cur=&order=asc&_=1653673850919"

def pageText():
    result = requests.get(url)
    doc = BeautifulSoup(result.text, "html.parser")
    return doc


specialNum = pageText()
print(specialNum) 
specialNum = pageText().find_all('literally anything I am trying to pull off of the page')
print(specialNum) #This will always print a blank list

Apologies if this is a stupid question. I'm a bit of a beginner.

Upvotes: 1

Views: 72

Answers (1)

HedgeHog
HedgeHog

Reputation: 25048

EDIT

as mentioned by @furas removing parameter and value callback=jsonp1653673850875 from url server will send pure JSON and you can get HTML directly via r.json()['componentData'].


Simplest approach in my opinion is to unwrap the JSON string and convert it with json.loads() to access the HTML.

From there you can go with beautifulsoup or pandas to scrape the content.

Example beautifulsoup
import json, requests
from bs4 import BeautifulSoup

r = requests.get('http://financials.morningstar.com/finan/financials/getFinancePart.html?&callback=jsonp1653673850875&t=XNAS:AAPL&region=usa&culture=en-US&cur=&order=asc&_=1653673850919')

soup = BeautifulSoup(
                    json.loads(
                        r.text.split('(',1)[-1].rsplit(')',1)[0]
                    )['componentData']
                )

for row in soup.select('table tr'):
    ...
Example pandas
import json, requests
import pandas as pd

r = requests.get('http://financials.morningstar.com/finan/financials/getFinancePart.html?&callback=jsonp1653673850875&t=XNAS:AAPL&region=usa&culture=en-US&cur=&order=asc&_=1653673850919')

pd.read_html(json.loads(
                        r.text.split('(',1)[-1].rsplit(')',1)[0]
                    )['componentData']
            )[0].dropna()
Output
Unnamed: 0 2012-09 2013-09 2014-09 2015-09 2016-09 2017-09 2018-09 2019-09 2020-09 2021-09 TTM
Revenue USD Mil 156508 170910 182795 233715 215639 229234 265595 260174 274515 365817 386017
Gross Margin % 43.9 37.6 38.6 40.1 39.1 38.5 38.3 37.8 38.2 41.8 43.3
Operating Income USD Mil 55241 48999 52503 71230 60024 61344 70898 63930 66288 108949 119379
Operating Margin % 35.3 28.7 28.7 30.5 27.8 26.8 26.7 24.6 24.1 29.8 30.9
Net Income USD Mil 41733 37037 39510 53394 45687 48351 59531 55256 57411 94680 101935
Earnings Per Share USD 1.58 1.42 1.61 2.31 2.08 2.3 2.98 2.97 3.28 5.61 6.15
Dividends USD 0.09 0.41 0.45 0.49 0.55 0.6 0.68 0.75 0.8 0.85 0.88
Payout Ratio % * 27.4 28.5 22.3 24.8 26.5 23.7 25.1 23.7 16.3 14.3
Shares Mil 26470 26087 24491 23172 22001 21007 20000 18596 17528 16865 16585
Book Value Per Share * USD 4.25 4.9 5.15 5.63 5.93 6.46 6.04 5.43 4.26 3.91 4.16
Operating Cash Flow USD Mil 50856 53666 59713 81266 65824 63598 77434 69391 80674 104038 116426
Cap Spending USD Mil -9402 -9076 -9813 -11488 -13548 -12795 -13313 -10495 -7309 -11085 -10633
Free Cash Flow USD Mil 41454 44590 49900 69778 52276 50803 64121 58896 73365 92953 105793
Free Cash Flow Per Share * USD 1.58 1.61 1.93 2.96 2.24 2.41 2.88 3.07 4.04 5.57
Working Capital USD Mil 19111 29628 5083 8768 27863 27831 14473 57101 38321 9355

Upvotes: 2

Related Questions