Reputation: 88
I am trying to extract entire textual data from the given URL below as an example. I have many URLs so automating. I tried every code posted here - they are giving error, eg AttributeError: 'NoneType' object has no attribute 'find_next'. Perhaps the open source software version is changed hence results are affected.
Here is one link: url = r"https://www.sec.gov/Archives/edgar/data/1166036/000110465904027382/0001104659-04-027382.txt" Anyone share a working code in python? The code should give out data that consists of entire textual info starting from PART I preferably if not from Item 1A all the way to the end.
Here is one for example that doesn't run: Extracting text section from (Edgar 10-K filings) HTML
Update: I did these on the SEC data
html = bs(page.content, "lxml")
text = html.get_text()
text = unicodedata.normalize("NFKD", text).encode('ascii', 'ignore').decode('utf8')
text = text.split("\n")
text = " ".join(text)
I got text as well as some junk like below - it might be coming from the tables - is there a way to filter these out:
<div style=""font-family: 'Times New Roman', Times, serif; font-size: 10pt;""><div style=""text-align: justify; line-height: 11.4pt; font-family: 'Times New Roman', Times, serif; font-size: 10pt; font-weight: bold;"">
<div style=""text-align: justify; line-height: 11.4pt; font-family: 'Times New Roman', Times, serif; font-size: 10pt; font-weight: bold;"">(4) MORTGAGE NOTES PAYABLE, BANK LINES OF CREDIT AND OTHER LOANS<div style=""line-height: 11.4pt;""><br style=""line-height: 11.4pt;"" /><div style=""text-align: justify; line-height: 11.4pt;""><font style=""font-size: 10pt; font-family: 'Times New Roman', Times, serif;"">At October 31, 2018, the Company has mortgage notes payable and other loans that are due in installments over various periods to fiscal 2031. The mortgage loans bear interest rates ranging from 3.5% to 6.6% and are collateralized by real estate investments having a net carrying value of approximately $558.2 million.<div style=""line-height: 11.4pt;""><br style=""line-height: 11.4pt;"" /><div style=""text-align: justify; line-height: 11.4pt; font-family: 'Times New Roman', Times, serif; font-size: 10pt;"">Combined aggregate principal maturities of mortgage notes payable during the next five years and thereafter are as follows (in thousands):<div style=""line-height: 11.4pt;""><br style=""line-height: 11.4pt;"" /><table align=""center"" border=""0"" cellpadding=""0"" cellspacing=""0"" style=""width: 80%; font-family: 'Times New Roman', Times, serif; font-size: 10pt;""><td valign=""bottom"" style=""vertical-align: top; padding-bottom: 2px;""> <td colspan=""1"" valign=""bottom"" style=""vertical-align: bottom; padding-bottom: 2px;""> <td colspan=""2"" valign=""bottom"" style=""vertical-align: top; border-bottom: #000000 solid 2px;""><div style=""text-align: center; line-height: 11.4pt;""><font style=""font-size: 10pt; font-family: 'Times New Roman', Times, serif;"">Principal<div style=""text-align: center; line-height: 11.4pt;""><font style=""font-size: 10pt; font-family: 'Times New Roman', Times, serif;"">Repayments<td colspan=""1"" nowrap=""nowrap"" valign=""bottom"" style=""text-align: left; vertical-align: bottom; padding-bottom: 2px;""> <td colspan=""1"" valign=""bottom"" style=""vertical-align: bottom; padding-bottom: 2px;""> <td colspan=""2"" valign=""bottom"" style=""vertical-align: top; border-bottom: #000000 solid 2px;""><div style=""text-align: center; line-height: 11.4pt;""><font style=""font-size: 10pt; font-family: 'Times New
Upvotes: 1
Views: 3475
Reputation: 61
something like that should work. As input is html form . Using BeatifulSoap parse html and replace the specific tags as div..., by end of line.
def parse_html(self, html):
elem = BeautifulSoup(html, features="html.parser")
elem = elem.body
text = ""
for e in elem.descendants:
if isinstance(e, str):
el = e.strip()
text += re.sub(r"\n", " ", el)
elif e.name in ["div", "br", "p", "h1", "h2", "h3", "h4", "tr", "th"]:
text += "\n"
elif e.name in ["td", "span"]:
text += " "
elif e.name == "li":
text += "\n- "
text = "\n".join([ll.rstrip() for ll in text.splitlines() if ll.strip()])
return text
html = bs(page.content, "lxml")
text = parse_html(html)
To extract a specific sections is more complicated task . You can use Item 1, Item 1A as a section breaker but forms are not standardized and that significantly complicates a task. Find an example in github I wrote
Upvotes: 0
Reputation: 317
I was also frustrated by this problem so I wrote sec-parsers
. It doesn't always work, but it is free.
Installation:
pip install sec-parsers
pip install edgar # useful for loading sec filings, not necessary
Script:
# Example based on https://stackoverflow.com/questions/71849443/extract-entire-textual-data-from-edgar-10-k-using-python
# SEC Parser
from sec_parsers.sec_parsers import parse_10k
#Dwight Gunning's very helpful edgar package https://github.com/dgunning/edgartools
from edgar import *
# xml
import xml.etree.ElementTree as ET
# sec requires identity to be set to pull from their server
set_identity("John Smith [email protected]")
# use edgar to get the latest 10-K filing for TSLA
filings = Company("TSLA").get_filings(form="10-K").latest(1)
# get html
html = filings.html()
# parse the sec html document using sec_parsers, visualize = True opens the parsed html in a browser
xml = parse_10k(html, visualize=False)
# sec parsers starts at part I, so we simply need to print the text of the root element
def extract_text(element):
text = ''
if element.text is not None:
text = element.text.strip()
for child in element:
text += extract_text(child)
return text
# text of the entire 10-k starting at part I
extract_text(xml.getroot())
sec-parsers
also supports granular text extraction, by item and by part (if available)
# prints the file structure of the xml
def print_xml_structure(tree):
root = tree.getroot()
def indent(level):
return " " * level
def print_element(element, level):
print(indent(level) + element.tag)
for child in element:
print_element(child, level + 1)
print_element(root, 0)
I've also added this script as a jupyter notebook on github.
By the way, the link you've posted is an amended 8-K report not a 10-K.
Upvotes: 1
Reputation: 2049
Your URL represents an amended 8-K filing (ie 8-K/A), and not a 10-K. 8-K filings have a different structure than 10-Ks. Item 1A does not exist in 8-Ks, neither do the other items from 1 to 15. I added a complete list of 10K and 8K items for comparison below. In other words, even if you manage to get a 10-K extraction algo working, it wouldn't work on 8-Ks.
I actually had to solve the same problem: extracting sections from 10-Ks, 10-Qs and 8-Ks and developed an extraction algorithm covering about 99% of all edge cases. The algo is a behemoth and utilizes many natural language processing strategies.
Here is an example illustrating how to extract item 1A and item 7 from Tesla's 10-K filing. It works for all other items too.
from sec_api import ExtractorApi # https://pypi.org/project/sec-api/
extractorApi = ExtractorApi("YOUR_API_KEY")
# Tesla 10-K filing
filing_url = "https://www.sec.gov/Archives/edgar/data/1318605/000156459021004599/tsla-10k_20201231.htm"
# get the standardized and cleaned text of section 1A "Risk Factors"
section_text = extractorApi.get_section(filing_url, "1A", "text")
# get the original HTML of section 7
# "Management’s Discussion and Analysis of Financial Condition and Results of Operations"
section_html = extractorApi.get_section(filing_url, "7", "html")
Output
section_text[0:1000]
includes:
ITEM 1A. RISK FACTORS\n\nYou should carefully consider the risks described below together with the other information set forth in this report, which could materially affect our business, financial condition and future results. The risks described below are not the only risks facing our company. Risks and uncertainties not currently known to us or that we currently deem to be immaterial also may materially adversely affect our business, financial condition and operating results. \n\nRisks Related to Our Ability to Grow Our Business\n\nWe may be impacted by macroeconomic conditions resulting from the global COVID-19 pandemic.\n\nSince the first quarter of 2020, there has been a worldwide impact from the COVID-19 pandemic. Government regulations and shifting social behaviors have limited or closed non-essential transportation, government functions, business activities and person-to-person interactions. In some cases, the relaxation of such trends has recently been followed by actual or...
Upvotes: 1