showkey
showkey

Reputation: 298

How can parse the whole file as SGML format with some python library?

I wnat to parse a 13-F form in SEC website to get all infoTable elements.

Get the target data:

from urllib.request import Request, urlopen
url = "https://www.sec.gov/Archives/edgar/data/1067983/000095012324011775/0000950123-24-011775.txt"
req = Request(
    url=url,
    headers={'User-Agent': '[email protected]',
             "Accept-Encoding":"gzip, deflate",
             'Host': 'www.sec.gov'}
    )
webpage = urlopen(req).read()
import gzip
content = gzip.decompress(webpage)
data = content.decode('utf-8')

parse with some lib.

With minidom

 from xml.dom import minidom   
 xmldoc = minidom.parseString(data)
 Traceback (most recent call last):     
  File "<stdin>", line 1, in <module>    
  File "/usr/lib/python3.11/xml/dom/minidom.py", line 2000, in parseString
    return expatbuilder.parseString(string)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/xml/dom/expatbuilder.py", line 925, in parseString
    return builder.parseString(string)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/xml/dom/expatbuilder.py", line 223, in parseString
  parser.Parse(string, True)
 xml.parsers.expat.ExpatError: not well-formed (invalid token): line 17, column 52

With xml.etree

import xml.etree.ElementTree as ET
tree = ET.fromstring(data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.11/xml/etree/ElementTree.py", line 1338, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 17, column 52

With lxml.etree

from lxml import etree
tree = etree.fromstring(data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "src/lxml/etree.pyx", line 3306, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1995, in lxml.etree._parseMemoryDocument
  File "src/lxml/parser.pxi", line 1875, in lxml.etree._parseDoc
  File "src/lxml/parser.pxi", line 1105, in lxml.etree._BaseParser._parseUnicodeDoc
  File "src/lxml/parser.pxi", line 633, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 743, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 672, in lxml.etree._raiseParseError
  File "<string>", line 17
lxml.etree.XMLSyntaxError: xmlParseEntityRef: no name, line 17, column 53

None of them can load the data ,all of them mentioned "not well-formed (invalid token): line 17, column 52".

enter image description here

I have not seen some strange tag at line 17,column 52 in the 13-F form.How to fix the not well-formed issue? Latest Updated:
To pick out part of section with regex expression ,then get the result with minidom or get it with pandas directly:

Pick out part of section with regex.

import re
pattern = re.compile(r"<informationTable.*?>.*?<\/informationTable>", flags=re.DOTALL)
data_str = re.findall(pattern,data)[0]

Get the result with minidom

from xml.dom import minidom 
dom = minidom.parseString(data_str)
dlist = dom.getElementsByTagName('infoTable')    
result = []
for item in dlist:
    emp_result = {}
    for child in item.childNodes:
        if child.nodeType == minidom.Node.ELEMENT_NODE:
            emp_result[child.tagName] = child.firstChild.data
    result.append(emp_result)

Get the result with pandas:

import pandas as pd
from io import StringIO  
df = pd.read_xml(StringIO(data_str))

Is there no powerful way to parse the whole file as SGML format with some python library?

Upvotes: 1

Views: 80

Answers (1)

JonSG
JonSG

Reputation: 13152

If you wanted just the "XML" then you might be able to parse the SGML file and pick out what you want to parse with one of your more formal XML parsers.

import requests
import re

pattern = re.compile(r"<XML>(.*?)<\/XML>", flags=re.DOTALL)
url = "https://www.sec.gov/Archives/edgar/data/1067983/000095012324011775/0000950123-24-011775.txt"
headers={'User-Agent': '[email protected]', "Accept-Encoding":"gzip, deflate", 'Host': 'www.sec.gov'}
req = requests.get(url, headers=headers)

for index, match in enumerate(pattern.finditer(req.text)):
    ## just the first 10 characters of the XML document
    print(index, match.group(1)[:10].strip() + "...")

That gives me:

0 <?xml ver...
1 <informat...

Upvotes: 0

Related Questions