Reputation: 1702
I'm trying to write a script that will take CIK, report type and, optionally, as-of-date and return parsed financial report information from the SEC EDGAR public index.
The script mostly works and returns a dataframe of all values with description and a reference year parsed from the contextref attribute. There's unfortunately, quite a bit of variability in the form contextref takes for different tags, and so I wonder if there's a cleaner way to extract this information than using some logic and regular expressions as I currently am. I checked the xbrl documentation, and it makes reference to a 'period' element, but I don't see it when I check tag.attrs
. Wondering if there's a more straightforward way of extracting this info aside from some wonky regex.
Happy to provide examples of the more problematic contextref values if it's instructive.
Relevant portion of the code below:
import pandas as pd
from bs4 import BeautifulSoup
from requests import get
link = 'https://www.sec.gov/Archives/edgar/data/789019/000156459018019062/msft-20180630.xml'
r = get(link)
str = r.text
soup = BeautifulSoup(str, 'lxml')
tags = soup.find_all()
df = pd.DataFrame(columns=['field','period','value'])
for tag in tags:
if ('us-gaap:' in tag.name # only want gaap-related tags
and tag.text.isdigit()): # only want values, no commentary
#a = re.match("^C_"+ re.escape(cik) + "_[0-9]", tag['contextref'])
name = tag.name.split('gaap:')[1]
cref = tag['contextref'][-8:-4]
value = tag.text
df = df.append({'field': name, 'period': cref, 'value': value}, ignore_index=True)
print(df)
Upvotes: 0
Views: 1733
Reputation: 993
I'm afraid that your approach is fundamentally flawed. The value of the contextRef attribute is an arbitrary identifier that references a context element elsewhere in the document. Whilst the example you're looking at may contain a year in it, these identifiers could be anything (e.g. c1, c2, c3, etc.) In order to obtain the year, you need to dereference the context identified by the contextRef attribute, and look at the elements within the <period>
element, e.g.
<xbrli:context id="c1">
<xbrli:entity>
<xbrli:identifier scheme="http://www.example.com/1234">1234</xbrli:identifier>
</xbrli:entity>
<xbrli:period>
<xbrli:startDate>2018-01-01</xbrli:startDate>
<xbrli:endDate>2018-12-31</xbrli:endDate>
</xbrli:period>
</xbrl:context>
Further, the us-gaap:
part of the element name is a namespace prefix. XML documents may legitimately use other prefixes to refer to the same namespace. What matters is the namespace that the prefix is bound to, via the xmlns:us-gaap="..."
declaration, usually on the root element. You should use a namespace-aware XML parser. I don't think that beautifulsoup is properly namespace-aware.
I believe that the SEC system does restrict filers to using "recommended" namespace prefixes, so you might get away with this approach on SEC documents, but I would strongly recommend using an XBRL processor which will take care of namespaces, dereferencing contexts and many other issues associated with consuming XBRL. Arelle is an open source XBRL processor, but there are many others available.
Using an XBRL Processor will also give you access to information such as human-readable labels from the taxonomy.
Upvotes: 2