Reputation: 3671
I am trying to scrape data from a website, but the data table is rendered by JavaScript. Instead of using a tool like Selenium to generate the page and run the script, I have instead found the script tag where the data is stored and am trying to pull the data directly from there.
Here is the code:
import requests
from bs4 import BeautifulSoup
import json
url = 'https://www.etf.com/SPY'
result = requests.get(url)
c = result.content
html = BeautifulSoup(c, 'html.parser')
script = html.find_all('script')[-22] #this is the script tag that has the data
script = script.contents
js = script[0]
data = js[31:-2] #data is the json/dict which has the data
This is a snippet of what the contents of data looks like:
s = json.loads(data)
s = s['etf_report_from_api']['modalInfoToActive']['top10Holdings']['data']
s = s[13:-2]
Here is a snippet of what s looks like:
At this point the content is looking more like HTML, but it seems like the escape characters have not been unescaped properly
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print("Encountered a start tag:", tag)
def handle_endtag(self, tag):
print("Encountered an end tag :", tag)
def handle_data(self, data):
print("Encountered some data :", data)
parser = MyHTMLParser()
Here is the output of the parser. it seems to be able to recognize certain tags but is identifying others as data due to the formatting issue.
This data is essentially an HTML table, but how can I properly decode/parse it to extract the data contents?
Upvotes: 1
Views: 1306
Reputation: 11157
It looks to me like you simply need to unescape "
and /
values in your string s
, and then you can successfully parse the markup with bs4
:
soup = BeautifulSoup(s.replace(r"\"", '"').replace(r"\/", "/"), "html.parser")
for row in soup.find_all("tr"):
name, value = row.find_all("td")
print(f"{name.text}\t{value.text}")
Result:
Microsoft Corporation 3.55% Apple Inc. 3.31% Amazon.com, Inc. 3.11% Facebook, Inc. Class A 1.76% Berkshire Hathaway Inc. Class B 1.76% ...
Upvotes: 2