Parsing content with BeautifulSoup

Question

I'm trying to write a script that will take CIK, report type and, optionally, as-of-date and return parsed financial report information from the SEC EDGAR public index.

The script mostly works and returns a dataframe of all values with description and a reference year parsed from the contextref attribute. There's unfortunately, quite a bit of variability in the form contextref takes for different tags, and so I wonder if there's a cleaner way to extract this information than using some logic and regular expressions as I currently am. I checked the xbrl documentation, and it makes reference to a 'period' element, but I don't see it when I check tag.attrs. Wondering if there's a more straightforward way of extracting this info aside from some wonky regex.

Happy to provide examples of the more problematic contextref values if it's instructive.

Relevant portion of the code below:

import pandas as pd
from bs4 import BeautifulSoup
from requests import get

link = 'https://www.sec.gov/Archives/edgar/data/789019/000156459018019062/msft-20180630.xml'
r = get(link)
str = r.text

soup = BeautifulSoup(str, 'lxml')
tags = soup.find_all()
df = pd.DataFrame(columns=['field','period','value']) 

for tag in tags:
    if ('us-gaap:' in tag.name   # only want gaap-related tags 
            and tag.text.isdigit()): # only want values, no commentary
        #a = re.match("^C_"+ re.escape(cik) + "_[0-9]", tag['contextref'])     
            name = tag.name.split('gaap:')[1]
            cref = tag['contextref'][-8:-4]
            value = tag.text
            df = df.append({'field': name, 'period': cref, 'value': value}, ignore_index=True)

print(df)

Parsing content with BeautifulSoup

Answers (1)

Related Questions