Mustard Tiger
Mustard Tiger

Reputation: 3671

Python parse HTML with escape characters

I am trying to scrape data from a website, but the data table is rendered by JavaScript. Instead of using a tool like Selenium to generate the page and run the script, I have instead found the script tag where the data is stored and am trying to pull the data directly from there.

Here is the code:

import requests
from bs4 import BeautifulSoup
import json

url = 'https://www.etf.com/SPY'

result = requests.get(url)

c = result.content
html = BeautifulSoup(c, 'html.parser')

script = html.find_all('script')[-22]   #this is the script tag that has the data

script = script.contents

js = script[0]
data = js[31:-2]  #data is the json/dict which has the data

This is a snippet of what the contents of data looks like:

enter image description here

s = json.loads(data)

s = s['etf_report_from_api']['modalInfoToActive']['top10Holdings']['data']

s = s[13:-2]

Here is a snippet of what s looks like:

enter image description here

At this point the content is looking more like HTML, but it seems like the escape characters have not been unescaped properly

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)

    def handle_endtag(self, tag):
        print("Encountered an end tag :", tag)

    def handle_data(self, data):
        print("Encountered some data  :", data)

parser = MyHTMLParser()

Here is the output of the parser. it seems to be able to recognize certain tags but is identifying others as data due to the formatting issue.

enter image description here

This data is essentially an HTML table, but how can I properly decode/parse it to extract the data contents?

Upvotes: 1

Views: 1306

Answers (1)

cody
cody

Reputation: 11157

It looks to me like you simply need to unescape " and / values in your string s, and then you can successfully parse the markup with bs4:

soup = BeautifulSoup(s.replace(r"\"", '"').replace(r"\/", "/"), "html.parser")

for row in soup.find_all("tr"):
    name, value = row.find_all("td")
    print(f"{name.text}\t{value.text}")

Result:

Microsoft Corporation   3.55%
Apple Inc.  3.31%
Amazon.com, Inc.    3.11%
Facebook, Inc. Class A  1.76%
Berkshire Hathaway Inc. Class B 1.76%
...

Upvotes: 2

Related Questions