Reputation: 2296

Splitting large xml file into multiple files by using beautifulsoup

I am trying to split large xml file into smaller ones, first I started off beautifulsoup:

from bs4 import BeautifulSoup
import os
# Core settings
rootdir = r'C:\Users\XX\Documents\Grant Data\2010_xml'
extension = ".xml"
to_save = r'C:\Users\XX\Documents\all_patents_as_xml'

index = 0
for root, dirs, files in os.walk(rootdir):
    for file in files:
        if file.endswith(extension):
            print(file)
            file_name = os.path.join(root,file)
            with open(file_name) as f:
                data = f.read()
            texts = data.split('?xml version="1.0" encoding="UTF-8"?')
            for text in texts:
                index += 1
                filename = to_save + "\\"+ str(index) + ".txt"
                with open(filename, 'w') as f:
                    f.write(text)

However, I got a memory error. Then I switched to xml etree:

from xml.etree import ElementTree as ET
import re


file_name = r'C:\Users\XX\Documents\Grant Data\2010_xml\2010cat_xml.xml'


with open(file_name) as f:
    xml = f.read()
tree = ET.fromstring(re.sub(r"(<\?xml[^>]+\?>)", r"\1<root>", xml) + "</root>")
parser = ET.iterparse(tree)
to_save = r'C:\Users\Yilmaz\Documents\all_patents_as_xml'
index = 0
for event, element in parser:
    # element is a whole element
    if element.tag == '?xml version="1.0" encoding="UTF-8"?':
        index += 1
        filename = to_save + "\\"+ str(index) + ".txt"
        with open(filename, 'w') as f:
            f.write(ET.tostring(element))
        # do something with this element
        # then clean up
        element.clear()

and I get the following error:

OverflowError: size does not fit in an int

I am using windows operating system, I know in Linux you can split the xmls from consule but in my case I don't know what to do.

Upvotes: 2

Answers (2)

Louis

Reputation: 151511

There are major issues with your question and your attempts at solving it:

You mention using Beautiful Soup. However, while you import Beautiful Soup in your code, you don't actually do anything with it.
The code you show that uses xml.etree is grossly incorrect. At the line parser = ET.iterparse(tree), tree is an XML tree already parsed with ET.fromstring, but the argument to iterparse must either be a file name or a file object. An XML tree is neither of those. So that attempt is dead on arrival.

But more importantly, it looks like what you are trying to process is a file which contains a bunch of concatenated XML files. In your xml.etree attempt you have this test:

element.tag == '?xml version="1.0" encoding="UTF-8"?'

The only intent I can imagine for this test is that you think that xml.etree will somehow interpret <?xml version="1.0" encoding="UTF-8"?> as an XML element which has a name of '?xml version="1.0" encoding="UTF-8"?'. However, the structure <?xml version="1.0" encoding="UTF-8"?> is not an XML element, it is an XML declaration.

And since your code seems to be attempting to split every time an XML declaration is encountered, it seems that your input is a file that contains multiple XML declarations. This file is not valid XML. The XML specification allows the XML declaration to appear once, and only once at the beginning of an XML file. (Don't confuse the XML declaration with a processing instruction. They look similar because they are both delimited by <? and ?>, but the XML declaration is not a processing instruction.) If you use an XML parser on your input file, and this parser conforms to the XML specification, then it has to reject your file as being not XML because XML does not allow XML declarations to appear at random positions in documents.

Where does that leave you? If all XML declarations present in your source document are the same, there's a relatively easy way to make your document parsable by an XML parser. (The attempts you made suggest that they are all the same since you do not use a regular expressions to match different forms of the XML declaration (e.g. one that would specify the standalone parameter).) You can just remove all XML declarations from your source document, wrap it in a new root element, and parse that with xml.etree. (This assumes that the individual XML documents that were concatenated to make up your source document were all individually well-formed. If they weren't then this won't work.)

Note, however, that the string <?xml version="1.0" encoding="UTF-8"?> can appear in an XML document in contexts where this string is not actually an XML declaration. Here is a well-formed XML document that would throw off an algorithm that just looks for a string that looks like an XML declaration:

<?xml version = "1.0" encoding = "UTF-8"?>
<a>
  <![CDATA[
           <?xml version = "1.0" encoding = "UTF-8"?>
  ]]>
  <?q <?xml version = "1.0" encoding = "UTF-8"?> ?>
  <!-- <?xml version = "1.0" encoding = "UTF-8"?> -->
</a>

If you know how your source file was created, you may already be able to know for sure that you don't have any of the cases above. Otherwise, you may want to examine your source to make sure none of the above happens.

Once you take care of this, then using a strategy based on ET.iterparse, or SAX should work.

Upvotes: 1

balderman

Reputation: 23825

If your XML can not be loaded because of memory limits, you should consider using SAX.

With SAX you will read "small bites" of the document, do what ever you want to do with them (Example: Save every N elements to a new file).

Python SAX example 1.

Python SAX example 2.

Upvotes: 2

Splitting large xml file into multiple files by using beautifulsoup

Answers (2)

Related Questions