Reputation: 117

How can I get the contents between two tags in a html page using Beautiful Soup?

I am trying to extract the text from the Risk Factors section of this 10K report from the SEC's EDGAR database https://www.sec.gov/Archives/edgar/data/101830/000010183019000022/sprintcorp201810-k.htm

As you can see I have managed to identify the headings for the Risk Factors (section I want to grab all the text from) and Unresolved Staff Comments (section immediately after Risk Factors) sections but I have been unable to then proceed to identify/grab all the text from between these to headings (the text in the Risk Factors section).

As you can see here I have tried the "next_sibling" method and some other suggestions on SO but I am still doing it incorrectly.

Code:


import requests
import bs4 as bs

file = requests.get('https://www.sec.gov/Archives/edgar/data/101830/000010183019000022/sprintcorp201810-k.htm')
soup = bs.BeautifulSoup(file.content, 'html.parser')
risk_factors_header = soup.find_all("a", text="Risk Factors")[0]
staff_comments_header = soup.find_all("a", text="Unresolved Staff Comments")[0]
risk_factors_text = risk_factors_header.next_sibling

print(risk_factors_text.contents)

Extract of Desired Output (looking for all text in the Risk Factors section):

In addition to the other information contained in this Annual Report on Form 10-K, the following risk factors should be considered carefully in evaluating us. Our business, financial condition, liquidity or results of operations could be materially adversely affected by any of these risks.
Risks Relating to the Merger Transactions
The closing of the Merger Transactions is subject to many conditions, including the receipt of approvals from various governmental entities, which may not approve the Merger Transactions, may delay the approvals for, or may impose conditions or restrictions on, jeopardize or delay completion of, or reduce the anticipated benefits of, the Merger Transactions, and if these conditions are not satisfied or waived, the Merger Transactions will not be completed.
The completion of the Merger Transactions is subject to a number of conditions, including, among others, obtaining certain governmental authorizations, consents, orders or other approvals and the absence of any injunction prohibiting the Merger Transactions or any legal requ........

Upvotes: 1

Answers (4)

Andrej Kesely

Reputation: 195653

Another solution. You can use .find_previous_sibling() to check if you're inside section you want:

import requests
from bs4 import BeautifulSoup


url = 'https://www.sec.gov/Archives/edgar/data/101830/000010183019000022/sprintcorp201810-k.htm#s8925A97DDFA55204808914F6529AC721'
soup = BeautifulSoup(requests.get(url).content, 'lxml')

out = []
for tag in soup.find('text').find_all(recursive=False):
    prev = tag.find_previous_sibling(lambda t: t.name == 'table' and t.text.startswith('Item'))
    if prev and prev.text.startswith('Item 1A.') and not tag.text.startswith('Item 1B'):
        out.append(tag.text)

# print the section:
print('\n'.join(out))

Prints:

In addition to the other information contained in this Annual Report on Form 10-K, the following risk factors should be considered carefully in evaluating us. Our business, financial condition, liquidity or results of operations could be materially adversely affected by any of these risks.
Risks Relating to the Merger Transactions

...


agreed to implement certain measures to protect national security, certain of which may materially and adversely affect our operating results due to increasing the cost of compliance with security measures, and limiting our control over certain U.S. facilities, contracts, personnel, vendor selection, and operations. If we fail to comply with our obligations under the NSA or other agreements, our ability to operate our business may be adversely affected.

Upvotes: 1

Jack Fleeting

Reputation: 24940

I would take a totally different approach from the other answers here - because you are dealing with EDGAR filings which are terrible as a general matter and especially terrible when it comes to html (and, if you are unlucky enough to have to deal with it, xbrl).

So in order to extract the Risk Factors section I resort to the method below. It relies on the fact that Risk Factors is always Item IA and is always (at least in my experience so far) followed by Item 1B even if, as in this case, Item IB is "none".

filing = ''
for f in soup.select('font'):
    if f.text is not None and t.text != "Table of Contents":
        filing+=(f.text)+" \n"
print(filing.split('Item 1B')[0].split('Item 1A')[-1])

You lose most of the formatting and, as always, there will be some clean up to do anyway, but it's close enough - in most cases.

Note that, this being EDGAR, sooner or later you will run into another filing where the text is not in <font> but in some other tag - so you'll have to adopt...

Upvotes: 1

QHarr

Reputation: 84475

Rather ugly but you could first remove the page numbers and table of contents links then use filtering to remove the stop point header and its subsequent siblings from the target header and its subsequent siblings. Requires bs4 4.7.1+

for unwanted in soup.select('a:contains("Table of Contents"), div:has(+[style="page-break-after:always"])'):
    unwanted.decompose() #remove table of contents hyperlinks and page numbers


selector = ','.join(['table:has(tr:nth-of-type(2):has(font:contains("Risk Factors")))'
                    ,'table:has(tr:nth-of-type(2):has(font:contains("Risk Factors")))' + \
                     ' ~ *:not(table:has(tr:nth-of-type(2):has(font:contains("Unresolved Staff Comments"))), ' + \
                     'table:has(tr:nth-of-type(2):has(font:contains("Unresolved Staff Comments"))) ~ *)' 
           ])

text = '\n'.join([i.text for i in soup.select(selector)])
print(text, end='\n')

Using variables might make code easier to follow:

for unwanted in soup.select('a:contains("Table of Contents"), div:has(+[style="page-break-after:always"])'):
    unwanted.decompose() #remove table of contents hyperlinks and page numbers

start_header = 'table:has(tr:nth-of-type(2):has(font:contains("Risk Factors")))'
stop_header = 'table:has(tr:nth-of-type(2):has(font:contains("Unresolved Staff Comments")))'

selector = ','.join([start_header,start_header + f' ~ *:not({stop_header}, {stop_header} ~ *)'])

text = '\n'.join([i.text for i in soup.select(selector)])
print(text, end='\n')

You could of course loop siblings from target header until stop header found.

Upvotes: 0

JoseKilo

Reputation: 2463

A couple of issues:

You are selecting the link from the table of contents instead of the header: the header is not an a tag, but just a font tag (you can always inspect these details in a browser). However, if you try to do soup.find_all("font", text="Risk Factors") you will get 2 results because the link from the table of contents also has a font tag, so you would need to select the second one: soup.find_all("font", text="Risk Factors")[1].
Similar issue for the second header, but this time something funny happens: the header has an "invisible" space just before the closing tag, although the link from the TOC doesn't, so you would need to select it like this soup.find_all("font", text="Unresolved Staff Comments ")[0].
Another issue, the "text in between" is not a sibling (or siblings) of the tree elements that we've selected, but siblings with an ancestor from those elements. If you inspect the page source code, you will see that the headings are included inside a div, inside a table cell (td), inside a table row (tr), inside a table, so we need to go 4 parent levels up: risk_factors_header.parent.parent.parent.parent.
Also, there are several siblings that you are interested in, better to use next_siblings and iterate through all of them.
Once you've got all of that, you can use the second heading to break the iteration once you reach it.
Since you want to get the text only (ignoring all the html tags) you can use get_text() instead of content.

Ok, all together:

import requests                                                                                                                                                                      
import bs4 as bs                                                                                                                                                                     
                                                                                                                                                                                     
file = requests.get('https://www.sec.gov/Archives/edgar/data/101830/000010183019000022/sprintcorp201810-k.htm')                                                                      
soup = bs.BeautifulSoup(file.content, 'html.parser')                                                                                                                                 
risk_factors_header = soup.find_all("font", text="Risk Factors")[1]                                                                                                                  
staff_comments_header = soup.find_all("font", text="Unresolved Staff Comments ")[0]                                                                                                  
                                                                                                                                                                                     
for paragraph in risk_factors_header.parent.parent.parent.parent.next_siblings:                                                                                             
    if paragraph == staff_comments_header.parent.parent.parent.parent:                                                                                                               
        break                                                                                                                                                                        
                                                                                                                                                                                     
    print(paragraph.get_text())

Upvotes: 1

How can I get the contents between two tags in a html page using Beautiful Soup?

Answers (4)

Related Questions