Reputation: 393
I'm trying to modify some basic code for downloading and parsing SEC filings, but there's something being done in the parsing of the headers that I find completely baffling. I don't understand what's going on in the dictionary creation and header assignment of the following code:
def download_filing(filing):
data=None
try:
data=open(filing).read()
except:
print 'Failed to get data...'
if data==None: return None
headers={}
docs=[]
docdata={}
intext=False
inheaders=False
headerstack=['','','','','']
for line in data.split('\n'):
if line.strip()=='<DOCUMENT>':
# Beginning of a new document
docdata={'type':None,'sequence':-1,'filename':None,'description':None,'text':''}
elif line.strip()=='</DOCUMENT>':
# End of a documents
docs.append(docdata)
elif line.strip()=='<TEXT>':
# Text block
intext=True
elif line.strip()=='</TEXT>':
# End of the text block
intext=False
elif line.strip().startswith('<SEC-HEADER>'):
inheaders=True
elif line.strip().startswith('</SEC-HEADER>'):
inheaders=False
elif inheaders and line.strip()!='':
# Number of tabs before desc
level=line.find(line.strip())
sline=line.strip().replace(':','',1)
# Find the dictionary level
curdict=headers
for i in range(level):
curdict=curdict[headerstack[i]]
# Determine if this is a field or a another level of fields
if sline.find('\t')!=-1:
curdict[sline.split('\t')[0]]=sline.split('\t')[-1]
else:
headerstack[level]=sline
curdict.setdefault(sline,{})
elif intext:
docdata['text']+=line+'\n'
else:
# See if this is document metadata
for header in DOC_HEADERS:
if line.startswith(header):
field=DOC_HEADERS[header]
docdata[field]=line[len(header):]
return headers,docs
The goal is to parse through an sec filing like this: http://www.sec.gov/Archives/edgar/data/356213/0000898430-95-000806.txt
and return a tuple which contains a dictionary of dictionaries as "headers" and a list of dictionaries in "docs". Most of it appears pretty straightforward to me. Open the filing, read it line by line, and generate some control flow which tells the function whether it's in the header part of the document or the text part of the document. I also understand the list creation algorithm at the end which appends all of the "docdata" together.
However the headers part is blowing my mind. I more or less understand how the header parser is trying to create nests of dictionaries based on the number of tabs before each block item, and then determining where to stick each key. What I don't understand is how it is filling this into the "headers" variable. It appears to be assigning headers to curdict, which seems completely backwards to me. The program defines headers as an empty dict at the top, then for each line, assigns assigns this empty dictionary to curdict and then goes forth. It then returns headers which appears to never have been formally manipulated again.
I'm guessing that this my complete lack of understanding of how object assignment works in Python. I'm sure it's really obvious, but I'm not advanced enough to have seen programs written this way.
Upvotes: 0
Views: 107
Reputation: 780798
headers
is a nested tree of dictionaries. The loop that assigns to curdict
goes down to the Nth level in this tree, using headerstack[i]
as the key for each level. It starts by initializing curdict
to the top-level headers
, then on each iteration it resets it to the child dictionary based on the next item in headerstack
.
In Python, as in most OO languages, object assignment is by reference, not by copying. So once the final assignment to curdict
is done, it contains a reference to one of the nested dictionaries. Then when it does:
curdict[sline.split('\t')[0]]=sline.split('\t')[-1]
it fills in that dictionary element, which is still part of the full tree that headers
refers to.
For example, if headerstack
contains ['a', 'b', 'c', 'd']
and level = 3
, then the loop will set curdict
to a reference to headers['a']['b']['c']
. If sline
is foo\tbar
, the above assignment will then be equivalent to:
headers['a']['b']['c']['foo'] = 'bar';
I'll show how this happens, step-by-step. At the start of the loop, we have:
curdict == headers
During the first iteration of the loop:
i = 1
curdict = curdict[headerstack[i]]
is equivalent to:
curdict = headers['a']
On the next iteration:
i = 2
curdict = curdict[headerstack[i]]
is equivalent to:
curdict = curdict['b']
which is equivalent to:
curdict = headers['a']['b']
On the next (final) loop ieration:
i = 3
curdict = curdict[headerstack[i]]
which is equivalent to:
curdict = curdict['c']
which is:
curdict = headers['a']['b']['c']
So at this point, curdict
refers to the same dictionary that headers['a']['b']['c']
does. Anything you do to the dictionary in curdict
also happens to the dictionary in headers
. So when you do:
curdict['foo'] = 'bar'
it's equivalent to doing:
headers['a']['b']['c']['foo'] = 'bar'
Upvotes: 1