How can parse the whole file as SGML format with some python library?

Question

I wnat to parse a 13-F form in SEC website to get all infoTable elements.

Get the target data:

from urllib.request import Request, urlopen
url = "https://www.sec.gov/Archives/edgar/data/1067983/000095012324011775/0000950123-24-011775.txt"
req = Request(
    url=url,
    headers={'User-Agent': 'xxxx@gmail.com',
             "Accept-Encoding":"gzip, deflate",
             'Host': 'www.sec.gov'}
    )
webpage = urlopen(req).read()
import gzip
content = gzip.decompress(webpage)
data = content.decode('utf-8')

parse with some lib.

With minidom

 from xml.dom import minidom   
 xmldoc = minidom.parseString(data)
 Traceback (most recent call last):     
  File "", line 1, in     
  File "/usr/lib/python3.11/xml/dom/minidom.py", line 2000, in parseString
    return expatbuilder.parseString(string)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/xml/dom/expatbuilder.py", line 925, in parseString
    return builder.parseString(string)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/xml/dom/expatbuilder.py", line 223, in parseString
  parser.Parse(string, True)
 xml.parsers.expat.ExpatError: not well-formed (invalid token): line 17, column 52

With xml.etree

import xml.etree.ElementTree as ET
tree = ET.fromstring(data)
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/python3.11/xml/etree/ElementTree.py", line 1338, in XML
    parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 17, column 52

With lxml.etree

from lxml import etree
tree = etree.fromstring(data)
Traceback (most recent call last):
  File "", line 1, in 
  File "src/lxml/etree.pyx", line 3306, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1995, in lxml.etree._parseMemoryDocument
  File "src/lxml/parser.pxi", line 1875, in lxml.etree._parseDoc
  File "src/lxml/parser.pxi", line 1105, in lxml.etree._BaseParser._parseUnicodeDoc
  File "src/lxml/parser.pxi", line 633, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 743, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 672, in lxml.etree._raiseParseError
  File "", line 17
lxml.etree.XMLSyntaxError: xmlParseEntityRef: no name, line 17, column 53

None of them can load the data ,all of them mentioned "not well-formed (invalid token): line 17, column 52".

I have not seen some strange tag at line 17,column 52 in the 13-F form.How to fix the not well-formed issue? Latest Updated:
To pick out part of section with regex expression ,then get the result with minidom or get it with pandas directly:

Pick out part of section with regex.

import re
pattern = re.compile(r".*?<\/informationTable>", flags=re.DOTALL)
data_str = re.findall(pattern,data)[0]

Get the result with minidom

from xml.dom import minidom 
dom = minidom.parseString(data_str)
dlist = dom.getElementsByTagName('infoTable')    
result = []
for item in dlist:
    emp_result = {}
    for child in item.childNodes:
        if child.nodeType == minidom.Node.ELEMENT_NODE:
            emp_result[child.tagName] = child.firstChild.data
    result.append(emp_result)

Get the result with pandas:

import pandas as pd
from io import StringIO  
df = pd.read_xml(StringIO(data_str))

Is there no powerful way to parse the whole file as SGML format with some python library?

How can parse the whole file as SGML format with some python library?

Answers (1)

Related Questions