Reputation: 117
I am trying to extract the text from the Risk Factors section of this 10K report from the SEC's EDGAR database https://www.sec.gov/Archives/edgar/data/101830/000010183019000022/sprintcorp201810-k.htm
As you can see I have managed to identify the headings for the Risk Factors (section I want to grab all the text from) and Unresolved Staff Comments (section immediately after Risk Factors) sections but I have been unable to then proceed to identify/grab all the text from between these to headings (the text in the Risk Factors section).
As you can see here I have tried the "next_sibling" method and some other suggestions on SO but I am still doing it incorrectly.
Code:
import requests
import bs4 as bs
file = requests.get('https://www.sec.gov/Archives/edgar/data/101830/000010183019000022/sprintcorp201810-k.htm')
soup = bs.BeautifulSoup(file.content, 'html.parser')
risk_factors_header = soup.find_all("a", text="Risk Factors")[0]
staff_comments_header = soup.find_all("a", text="Unresolved Staff Comments")[0]
risk_factors_text = risk_factors_header.next_sibling
print(risk_factors_text.contents)
Extract of Desired Output (looking for all text in the Risk Factors section):
In addition to the other information contained in this Annual Report on Form 10-K, the following risk factors should be considered carefully in evaluating us. Our business, financial condition, liquidity or results of operations could be materially adversely affected by any of these risks.
Risks Relating to the Merger Transactions
The closing of the Merger Transactions is subject to many conditions, including the receipt of approvals from various governmental entities, which may not approve the Merger Transactions, may delay the approvals for, or may impose conditions or restrictions on, jeopardize or delay completion of, or reduce the anticipated benefits of, the Merger Transactions, and if these conditions are not satisfied or waived, the Merger Transactions will not be completed.
The completion of the Merger Transactions is subject to a number of conditions, including, among others, obtaining certain governmental authorizations, consents, orders or other approvals and the absence of any injunction prohibiting the Merger Transactions or any legal requ........
Upvotes: 1
Views: 828
Reputation: 195408
Another solution. You can use .find_previous_sibling()
to check if you're inside section you want:
import requests
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/Archives/edgar/data/101830/000010183019000022/sprintcorp201810-k.htm#s8925A97DDFA55204808914F6529AC721'
soup = BeautifulSoup(requests.get(url).content, 'lxml')
out = []
for tag in soup.find('text').find_all(recursive=False):
prev = tag.find_previous_sibling(lambda t: t.name == 'table' and t.text.startswith('Item'))
if prev and prev.text.startswith('Item 1A.') and not tag.text.startswith('Item 1B'):
out.append(tag.text)
# print the section:
print('\n'.join(out))
Prints:
In addition to the other information contained in this Annual Report on Form 10-K, the following risk factors should be considered carefully in evaluating us. Our business, financial condition, liquidity or results of operations could be materially adversely affected by any of these risks.
Risks Relating to the Merger Transactions
...
agreed to implement certain measures to protect national security, certain of which may materially and adversely affect our operating results due to increasing the cost of compliance with security measures, and limiting our control over certain U.S. facilities, contracts, personnel, vendor selection, and operations. If we fail to comply with our obligations under the NSA or other agreements, our ability to operate our business may be adversely affected.
Upvotes: 1
Reputation: 24930
I would take a totally different approach from the other answers here - because you are dealing with EDGAR filings which are terrible as a general matter and especially terrible when it comes to html (and, if you are unlucky enough to have to deal with it, xbrl).
So in order to extract the Risk Factors section I resort to the method below. It relies on the fact that Risk Factors is always Item IA and is always (at least in my experience so far) followed by Item 1B even if, as in this case, Item IB is "none".
filing = ''
for f in soup.select('font'):
if f.text is not None and t.text != "Table of Contents":
filing+=(f.text)+" \n"
print(filing.split('Item 1B')[0].split('Item 1A')[-1])
You lose most of the formatting and, as always, there will be some clean up to do anyway, but it's close enough - in most cases.
Note that, this being EDGAR, sooner or later you will run into another filing where the text is not in <font>
but in some other tag - so you'll have to adopt...
Upvotes: 1
Reputation: 84465
Rather ugly but you could first remove the page numbers and table of contents links then use filtering to remove the stop point header and its subsequent siblings from the target header and its subsequent siblings. Requires bs4 4.7.1+
for unwanted in soup.select('a:contains("Table of Contents"), div:has(+[style="page-break-after:always"])'):
unwanted.decompose() #remove table of contents hyperlinks and page numbers
selector = ','.join(['table:has(tr:nth-of-type(2):has(font:contains("Risk Factors")))'
,'table:has(tr:nth-of-type(2):has(font:contains("Risk Factors")))' + \
' ~ *:not(table:has(tr:nth-of-type(2):has(font:contains("Unresolved Staff Comments"))), ' + \
'table:has(tr:nth-of-type(2):has(font:contains("Unresolved Staff Comments"))) ~ *)'
])
text = '\n'.join([i.text for i in soup.select(selector)])
print(text, end='\n')
Using variables might make code easier to follow:
for unwanted in soup.select('a:contains("Table of Contents"), div:has(+[style="page-break-after:always"])'):
unwanted.decompose() #remove table of contents hyperlinks and page numbers
start_header = 'table:has(tr:nth-of-type(2):has(font:contains("Risk Factors")))'
stop_header = 'table:has(tr:nth-of-type(2):has(font:contains("Unresolved Staff Comments")))'
selector = ','.join([start_header,start_header + f' ~ *:not({stop_header}, {stop_header} ~ *)'])
text = '\n'.join([i.text for i in soup.select(selector)])
print(text, end='\n')
You could of course loop siblings from target header until stop header found.
Upvotes: 0
Reputation: 2443
A couple of issues:
a
tag, but just a font
tag (you can always inspect these details in a browser). However, if you try to do soup.find_all("font", text="Risk Factors")
you will get 2 results because the link from the table of contents also has a font
tag, so you would need to select the second one: soup.find_all("font", text="Risk Factors")[1]
.soup.find_all("font", text="Unresolved Staff Comments ")[0]
.div
, inside a table cell (td
), inside a table row (tr
), inside a table
, so we need to go 4 parent levels up: risk_factors_header.parent.parent.parent.parent
.next_siblings
and iterate through all of them.get_text()
instead of content
.Ok, all together:
import requests
import bs4 as bs
file = requests.get('https://www.sec.gov/Archives/edgar/data/101830/000010183019000022/sprintcorp201810-k.htm')
soup = bs.BeautifulSoup(file.content, 'html.parser')
risk_factors_header = soup.find_all("font", text="Risk Factors")[1]
staff_comments_header = soup.find_all("font", text="Unresolved Staff Comments ")[0]
for paragraph in risk_factors_header.parent.parent.parent.parent.next_siblings:
if paragraph == staff_comments_header.parent.parent.parent.parent:
break
print(paragraph.get_text())
Upvotes: 1