segfault
segfault

Reputation: 95

Extracting xml from a txt file

I'm trying to extract the xml portion of code from a txt file in python. The current txt file I'm using is from the edgar database and has multiple representations of a 10-k report in one txt file, having html then xml, and then some other representations like PDF.

If anyone knows a way to extract this xml so I can use it's tags, I'd greatly appreciate it.

Here's an example of the txt file I'm talking about: https://www.sec.gov/Archives/edgar/data/51143/000005114313000007/0000051143-13-000007.txt

Upvotes: 2

Views: 1374

Answers (2)

dabingsou
dabingsou

Reputation: 2469

How about this?

def getData(xml):
  # Processing your XML data after block reading. 
  pass
with open('0000051143-13-000007.txt', 'r') as file: # data.xml is your xml file path
  lines = []
  flag = False
  for line in file:
    if line.find('</XBRL>')>=0:
      getData("".join(lines))
      flag = False
      lines = []
    if flag or line.find('<?xml ')>=0:
      flag = True
      lines.append(line)

Upvotes: 0

Pedro Lobito
Pedro Lobito

Reputation: 98861

You can try using:

import requests, re

text = requests.get("https://www.sec.gov/Archives/edgar/data/51143/000005114313000007/0000051143-13-000007.txt").text
for xml in re.finditer(r"<FILENAME>([^\s]+.xml)\s<DESCRIPTION>[^\s]+\s<TEXT>\s<XBRL>(.*?)</XBRL>", text, re.IGNORECASE | re.DOTALL | re.MULTILINE):
    xml_filename = xml.group(1)
    xml_content = xml.group(2)
    with open(xml_filename, "w") as w:
        w.write(xml_content)

Demo

Upvotes: 1

Related Questions