Thatchatorn Thophol
Thatchatorn Thophol

Reputation: 29

Extracting particular text section between tags from HTML

I would like to extract text in a specific section from HTML file (section "Item 1A"). I want to get text start from "Item 1A", in the content section not from the table of content, and stop at "Item 1B." But there are several same texts of "Item 1A" and "Item 1B". How can I identify which specific text to start and to stop.

import requests
from bs4 import BeautifulSoup
import re

url = "https://www.sec.gov/Archives/edgar/data/1606163/000114420416089184/v434424_10k.htm"
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")
text = soup.get_text()

item1a = re.search(r"(item\s1A\.?)(.+)(item\s1B\.?)", text, re.DOTALL | re.IGNORECASE)

item1a.group(2)

The output captures text from the first "Item 1A" in the table of content not the header of the section.

Thus I want to know:

  1. How to capture text from the "Item 1A" of the content section instead of "Item 1A" from the table of content.

  2. Why it captured the last "Item 1B" instead of stopping at the "Item 1B" from the table of content.

Upvotes: 2

Views: 151

Answers (1)

Brad Solomon
Brad Solomon

Reputation: 40948

Since you have a soup that helps you work with the structure of the HTML, why not take advantage of that?

One way to phrase this is "find text in between two tags with specific attributes." (The tags representing the 1A and 1B headers.) For that you can pass a callable (a function) to soup.find():

import requests
from bs4 import BeautifulSoup
from bs4.element import Tag
import re

url = "https://www.sec.gov/Archives/edgar/data/1606163/000114420416089184/v434424_10k.htm"
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")

def is_pstyle(tag: tag) -> bool:
    return tag.name == "p" and tag.has_attr("style")

def is_i1a(tag: Tag) -> bool:
    return is_pstyle(tag) and re.match(r"Item 1A\..*", tag.text)

def is_i1b(tag: Tag) -> bool:
    return is_pstyle(tag) and re.match(r"Item 1B\..*", tag.text)

def grab_1a_thru_1b(soup: BeautifulSoup) -> str:
    start = soup.find(is_i1a)
    def gen_t():
        for tag in start.next_siblings:
            if is_i1b(tag):
                break
            else:
                if hasattr(tag, "get_text"):
                    yield tag.get_text()  # get_text("\n")
                else:
                    yield str(tag)
    return "".join(gen_t())

if __name__ == "__main__":
    print(grab_1a_thru_1b(soup))

First part of output:

The risks and uncertainties described below
are those specific to the Company which we currently believe have the potential to be material, but they may not be the only ones
we face. If any of the following risks, or any other risks and uncertainties that we have not yet identified or that we currently
consider not to be material, actually occur or become material risks, our business, prospects, financial condition, results of
operations and cash flows could be materially and adversely affected. Investors are advised to consider these factors along with
the other information included in this Annual Report and to review any additional risks discussed in our filings with the SEC.
 
Risks Associated with Our Business
 
We are a newly formed company with no operating history and, accordingly, you have no basis on which to evaluate our ability to achieve our business
objective.

The mini-functions is_pstyle, is_i1a, and is_i1b you can think of as "filters" - just different ways to find precisely the start and end tags. Then you iterate over the sibling tags between these tags. (.get_text() will work recursively within each sibling tag.)

Upvotes: 2

Related Questions