Reputation: 298
I wnat to parse a 13-F form in SEC website to get all infoTable elements.
Get the target data:
from urllib.request import Request, urlopen
url = "https://www.sec.gov/Archives/edgar/data/1067983/000095012324011775/0000950123-24-011775.txt"
req = Request(
url=url,
headers={'User-Agent': '[email protected]',
"Accept-Encoding":"gzip, deflate",
'Host': 'www.sec.gov'}
)
webpage = urlopen(req).read()
import gzip
content = gzip.decompress(webpage)
data = content.decode('utf-8')
parse with some lib.
With minidom
from xml.dom import minidom
xmldoc = minidom.parseString(data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.11/xml/dom/minidom.py", line 2000, in parseString
return expatbuilder.parseString(string)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/xml/dom/expatbuilder.py", line 925, in parseString
return builder.parseString(string)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/xml/dom/expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 17, column 52
With xml.etree
import xml.etree.ElementTree as ET
tree = ET.fromstring(data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.11/xml/etree/ElementTree.py", line 1338, in XML
parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 17, column 52
With lxml.etree
from lxml import etree
tree = etree.fromstring(data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "src/lxml/etree.pyx", line 3306, in lxml.etree.fromstring
File "src/lxml/parser.pxi", line 1995, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1875, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1105, in lxml.etree._BaseParser._parseUnicodeDoc
File "src/lxml/parser.pxi", line 633, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 743, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 672, in lxml.etree._raiseParseError
File "<string>", line 17
lxml.etree.XMLSyntaxError: xmlParseEntityRef: no name, line 17, column 53
None of them can load the data ,all of them mentioned "not well-formed (invalid token): line 17, column 52".
I have not seen some strange tag at line 17,column 52
in the 13-F form.How to fix the not well-formed
issue?
Latest Updated:
To pick out part of section with regex expression ,then get the result with minidom or get it with pandas directly:
Pick out part of section with regex.
import re
pattern = re.compile(r"<informationTable.*?>.*?<\/informationTable>", flags=re.DOTALL)
data_str = re.findall(pattern,data)[0]
Get the result with minidom
from xml.dom import minidom
dom = minidom.parseString(data_str)
dlist = dom.getElementsByTagName('infoTable')
result = []
for item in dlist:
emp_result = {}
for child in item.childNodes:
if child.nodeType == minidom.Node.ELEMENT_NODE:
emp_result[child.tagName] = child.firstChild.data
result.append(emp_result)
Get the result with pandas:
import pandas as pd
from io import StringIO
df = pd.read_xml(StringIO(data_str))
Is there no powerful way to parse the whole file as SGML format with some python library?
Upvotes: 1
Views: 80
Reputation: 13152
If you wanted just the "XML" then you might be able to parse the SGML file and pick out what you want to parse with one of your more formal XML parsers.
import requests
import re
pattern = re.compile(r"<XML>(.*?)<\/XML>", flags=re.DOTALL)
url = "https://www.sec.gov/Archives/edgar/data/1067983/000095012324011775/0000950123-24-011775.txt"
headers={'User-Agent': '[email protected]', "Accept-Encoding":"gzip, deflate", 'Host': 'www.sec.gov'}
req = requests.get(url, headers=headers)
for index, match in enumerate(pattern.finditer(req.text)):
## just the first 10 characters of the XML document
print(index, match.group(1)[:10].strip() + "...")
That gives me:
0 <?xml ver...
1 <informat...
Upvotes: 0