Python Beautiful Soup parser sometimes skips namespaces

Question

I'm using Beautiful Soup to parse corporate financial filings. These 10-K filings are XML format, obtained from the Securities and Exchange Commission's EDGAR site. I have downloaded about 50,000 filings, and organized them into 14 separate directories based on filing year. My script accesses one directory at a time, reads each XML filing in the directory with Beautiful Soup, and extracts some tags (these are XBRL, and relate to financial statement items).

My problem: When Beautiful Soup reads in a filing, it sometimes skips over one of the namespaces. Then it stores only a very abbreviated version of the file in memory, and does not locate the desired tags.

The part that stumps me is, this problem only occurs when the script reads through multiple directories. When I adjust the script to read in just one directory (corresponding to one year), Beautiful Soup reads in the complete file. But when the script iterates over each directory, then after reading through about 8,000 filings it begins to skip namespaces.

Relevant section of code:

    ## Read through each directory, from the first through last year. NOTE: Code works fine if I use 'range(2013, 2014)' or similar
    for archive_year in range(min(list_of_years), max(list_of_years)+1):
    
      ## Change path to new directory
      os.chdir("/home/tladika/Data/EDGAR/10K/XBRL/%d" %archive_year)  
      
      ## Read through each XBRL filing in the directory one at a time 
      for file in os.listdir():
      
        ## Restrict to files that are XML format
        if (re.match(r'.+[0-9]{4}\.xml$', file)):
        
          ## Extract label of each XML filing 
          xml_filing_label = re.match(r'(.+)\.xml$', file).group(1)
           
          ## Open each individual filing        
          with open(file, "r") as xml_filing:
                    
            ## Read filing into Beautiful Soup object, with XML parsing
            souped_file = bs(xml_filing.read(), "xml")
  
            ## Find all tags related to stock compensation expenses, and store them in a list. 
            tags_share_expense_gross = souped_file.find_all("AllocatedShareBasedCompensationExpense")

    ## ---> At this point, the list tags_share_expense_gross is occasionally empty, even though I verify by hand that the tag exists in the XML filing

Output and Evidence of Problem:

I hand-checked several filings for which the mistake occurs. Here is a screenshot of the head of one XML filing, when opened by a text editor: Original Filing

The full XML filing itself is here: https://www.sec.gov/Archives/edgar/data/899460/000119312513112693/mnkd-20121231.xml

When I read in just one year, the same information is stored in the souped_file object. But here is a screenshot of the entire XML filing (a printout of souped_file) when I iterate over directories: Wrong Filing

Note the filing is much shorter. The reason seems to be that one namespace is missing (circled in brown in the first image).

Does anyone know what could be going wrong? Thanks in advance!

Python Beautiful Soup parser sometimes skips namespaces

Answers (0)

Related Questions