Reputation: 1
I'm using Beautiful Soup to parse corporate financial filings. These 10-K filings are XML format, obtained from the Securities and Exchange Commission's EDGAR site. I have downloaded about 50,000 filings, and organized them into 14 separate directories based on filing year. My script accesses one directory at a time, reads each XML filing in the directory with Beautiful Soup, and extracts some tags (these are XBRL, and relate to financial statement items).
My problem: When Beautiful Soup reads in a filing, it sometimes skips over one of the namespaces. Then it stores only a very abbreviated version of the file in memory, and does not locate the desired tags.
The part that stumps me is, this problem only occurs when the script reads through multiple directories. When I adjust the script to read in just one directory (corresponding to one year), Beautiful Soup reads in the complete file. But when the script iterates over each directory, then after reading through about 8,000 filings it begins to skip namespaces.
Relevant section of code:
## Read through each directory, from the first through last year. NOTE: Code works fine if I use 'range(2013, 2014)' or similar
for archive_year in range(min(list_of_years), max(list_of_years)+1):
## Change path to new directory
os.chdir("/home/tladika/Data/EDGAR/10K/XBRL/%d" %archive_year)
## Read through each XBRL filing in the directory one at a time
for file in os.listdir():
## Restrict to files that are XML format
if (re.match(r'.+[0-9]{4}\.xml$', file)):
## Extract label of each XML filing
xml_filing_label = re.match(r'(.+)\.xml$', file).group(1)
## Open each individual filing
with open(file, "r") as xml_filing:
## Read filing into Beautiful Soup object, with XML parsing
souped_file = bs(xml_filing.read(), "xml")
## Find all tags related to stock compensation expenses, and store them in a list.
tags_share_expense_gross = souped_file.find_all("AllocatedShareBasedCompensationExpense")
## ---> At this point, the list tags_share_expense_gross is occasionally empty, even though I verify by hand that the tag exists in the XML filing
Output and Evidence of Problem:
I hand-checked several filings for which the mistake occurs. Here is a screenshot of the head of one XML filing, when opened by a text editor: Original Filing
The full XML filing itself is here: https://www.sec.gov/Archives/edgar/data/899460/000119312513112693/mnkd-20121231.xml
When I read in just one year, the same information is stored in the souped_file object. But here is a screenshot of the entire XML filing (a printout of souped_file) when I iterate over directories: Wrong Filing
Note the filing is much shorter. The reason seems to be that one namespace is missing (circled in brown in the first image).
Does anyone know what could be going wrong? Thanks in advance!
Upvotes: 0
Views: 61