user12167490
user12167490

Reputation: 27

Extract all the text from a special type of URL using Python

I am trying to extract all the text from some SEC filings with their URLs. I can get the job done for most URLs until I run into some special type of URLs (seems to be related to XBRL).

url_1 that my code works: https://www.sec.gov/Archives/edgar/data/1044378/000156459020025525/bioc-10q_20200331.htm

url_2 that my code doesn't work: https://www.sec.gov/ix?doc=/Archives/edgar/data/1002590/000156459020020844/sgu-10q_20200331.htm

Here is my code:

with urllib.request.urlopen(url) as url:
    html = url.read()
soup = BeautifulSoup(html, "html.parser")
for table in soup.find_all("table"):
    table.decompose()
for script in soup(["script", "style"]):
    script.extract()  
text = soup.get_text()
print (text)

I am new to Python, and learned this via some youtube videos, could someone help me on how to extract all the text for the url_2.

Thank you

Upvotes: 1

Views: 549

Answers (1)

Jack Fleeting
Jack Fleeting

Reputation: 24930

You are right in that url_2 is an iXBRL link. Fortunately, the link to the plain vanilla filing is hiding right there.

Try this:

url_2 = "https://www.sec.gov/ix?doc=/Archives/edgar/data/1002590/000156459020020844/sgu-10q_20200331.htm"
url_3=url.replace('ix?doc=/','')
url_3

Output:

'https://www.sec.gov/Archives/edgar/data/1002590/000156459020020844/sgu-10q_20200331.htm'

Just use this as your target url.

Upvotes: 3

Related Questions