Reputation: 95
I'm trying to extract the xml portion of code from a txt file in python. The current txt file I'm using is from the edgar database and has multiple representations of a 10-k report in one txt file, having html then xml, and then some other representations like PDF.
If anyone knows a way to extract this xml so I can use it's tags, I'd greatly appreciate it.
Here's an example of the txt file I'm talking about: https://www.sec.gov/Archives/edgar/data/51143/000005114313000007/0000051143-13-000007.txt
Upvotes: 2
Views: 1374
Reputation: 2469
How about this?
def getData(xml):
# Processing your XML data after block reading.
pass
with open('0000051143-13-000007.txt', 'r') as file: # data.xml is your xml file path
lines = []
flag = False
for line in file:
if line.find('</XBRL>')>=0:
getData("".join(lines))
flag = False
lines = []
if flag or line.find('<?xml ')>=0:
flag = True
lines.append(line)
Upvotes: 0
Reputation: 98861
You can try using:
import requests, re
text = requests.get("https://www.sec.gov/Archives/edgar/data/51143/000005114313000007/0000051143-13-000007.txt").text
for xml in re.finditer(r"<FILENAME>([^\s]+.xml)\s<DESCRIPTION>[^\s]+\s<TEXT>\s<XBRL>(.*?)</XBRL>", text, re.IGNORECASE | re.DOTALL | re.MULTILINE):
xml_filename = xml.group(1)
xml_content = xml.group(2)
with open(xml_filename, "w") as w:
w.write(xml_content)
Upvotes: 1