Atihska
Atihska

Reputation: 5126

Splitting large xml files in n groups

I have a large xml file with the parent tag having 97k child tags. I want to split into 10 files, each of 10k tags and the last one with the remaining.

I have this code for writing one child tag to each file but unable to come up with groups.

So assume my sample xml with 10 child tags and I want to create 5 files each with 2 child tags.

My sample xml:

<root>
    <row>
        <NAME>A</NAME>
        <FIRSTNAME>A</FIRSTNAME>
        <GENDER>M</GENDER>
    </row>
    <row>
        <NAME>B</NAME>
        <FIRSTNAME>B</FIRSTNAME>
        <GENDER>M</GENDER>
    </row>
<row>
        <NAME>A</NAME>
        <FIRSTNAME>A</FIRSTNAME>
        <GENDER>M</GENDER>
    </row>
    <row>
        <NAME>B</NAME>
        <FIRSTNAME>B</FIRSTNAME>
        <GENDER>M</GENDER>
    </row>
<row>
        <NAME>A</NAME>
        <FIRSTNAME>A</FIRSTNAME>
        <GENDER>M</GENDER>
    </row>
    <row>
        <NAME>B</NAME>
        <FIRSTNAME>B</FIRSTNAME>
        <GENDER>M</GENDER>
    </row>
<row>
        <NAME>A</NAME>
        <FIRSTNAME>A</FIRSTNAME>
        <GENDER>M</GENDER>
    </row>
    <row>
        <NAME>B</NAME>
        <FIRSTNAME>B</FIRSTNAME>
        <GENDER>M</GENDER>
    </row>
<row>
        <NAME>A</NAME>
        <FIRSTNAME>A</FIRSTNAME>
        <GENDER>M</GENDER>
    </row>
    <row>
        <NAME>B</NAME>
        <FIRSTNAME>B</FIRSTNAME>
        <GENDER>M</GENDER>
    </row>
</root>

And my result should be 5 files, each having 2 entries as follows:

<root>
        <row>
            <NAME>A</NAME>
            <FIRSTNAME>A</FIRSTNAME>
            <GENDER>M</GENDER>
        </row>
        <row>
            <NAME>B</NAME>
            <FIRSTNAME>B</FIRSTNAME>
            <GENDER>M</GENDER>
        </row>
</root>

The below code put each child tag per file but I want here for example 2 tags per file.

import xml.etree.ElementTree as ET
context = ET.iterparse('file.xml', events=('end', ))
index = 0
for event, elem in context:
    if elem.tag == 'row':
        index += 1
        filename = format(str(index) + ".xml")
        with open(filename, 'wb') as f:
            f.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n")
            f.write(ET.tostring(elem))

Thanks in advance!

EDIT to add recipes:

from itertools import zip_longest

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

Upvotes: 0

Views: 399

Answers (1)

abarnert
abarnert

Reputation: 366073

You have an iterable of (event, elements) pairs:

context = ET.iterparse('file.xml', events=('end', ))

Now, you want to filter this down to just the row elements:

rows = (elem for event, elem in context if elem.tag == 'row')

Now you want to group them. Use the grouper recipe from the itertools docs:

groups = grouper(rows, 2)

You can obviously change that 2 to 1000 or whatever once you get things working and want to run it for real.

Now, you can just iterate the groups. While we're at it, let's use enumerate so you don't need that manual index += 1 stuff. Also, instead of building a string manually and then pointlessly calling format on it, let's just use an f-string.

for index, group in enumerate(groups):
    # If you need to run on 3.5 or 2.7, use "{}.xml".format(index)
    filename = f"{index}.xml"
    with open(filename, 'wb') as f:
        f.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n")

… then iterate the elements within the group—but be careful; if you had an odd number of elements, grouper will fill in the incomplete last group with None values.1

        for elem in group:
            if elem:
                f.write(ET.tostring(elem))

1. This isn't that hard to change, but I'm using the recipe directly out of the docs so I don't have to explain how to change it.

Upvotes: 1

Related Questions