Extract all the text from a special type of URL using Python

Question

I am trying to extract all the text from some SEC filings with their URLs. I can get the job done for most URLs until I run into some special type of URLs (seems to be related to XBRL).

url_1 that my code works: https://www.sec.gov/Archives/edgar/data/1044378/000156459020025525/bioc-10q_20200331.htm

url_2 that my code doesn't work: https://www.sec.gov/ix?doc=/Archives/edgar/data/1002590/000156459020020844/sgu-10q_20200331.htm

Here is my code:

with urllib.request.urlopen(url) as url:
    html = url.read()
soup = BeautifulSoup(html, "html.parser")
for table in soup.find_all("table"):
    table.decompose()
for script in soup(["script", "style"]):
    script.extract()  
text = soup.get_text()
print (text)

I am new to Python, and learned this via some youtube videos, could someone help me on how to extract all the text for the url_2.

Thank you

Extract all the text from a special type of URL using Python

Answers (1)

Related Questions