Reputation: 27
I am trying to extract all the text from some SEC filings with their URLs. I can get the job done for most URLs until I run into some special type of URLs (seems to be related to XBRL).
url_1 that my code works: https://www.sec.gov/Archives/edgar/data/1044378/000156459020025525/bioc-10q_20200331.htm
url_2 that my code doesn't work: https://www.sec.gov/ix?doc=/Archives/edgar/data/1002590/000156459020020844/sgu-10q_20200331.htm
Here is my code:
with urllib.request.urlopen(url) as url:
html = url.read()
soup = BeautifulSoup(html, "html.parser")
for table in soup.find_all("table"):
table.decompose()
for script in soup(["script", "style"]):
script.extract()
text = soup.get_text()
print (text)
I am new to Python, and learned this via some youtube videos, could someone help me on how to extract all the text for the url_2.
Thank you
Upvotes: 1
Views: 549
Reputation: 24930
You are right in that url_2 is an iXBRL link. Fortunately, the link to the plain vanilla filing is hiding right there.
Try this:
url_2 = "https://www.sec.gov/ix?doc=/Archives/edgar/data/1002590/000156459020020844/sgu-10q_20200331.htm"
url_3=url.replace('ix?doc=/','')
url_3
Output:
'https://www.sec.gov/Archives/edgar/data/1002590/000156459020020844/sgu-10q_20200331.htm'
Just use this as your target url.
Upvotes: 3