Reputation: 5126
I have a large xml file with the parent tag having 97k child tags. I want to split into 10 files, each of 10k tags and the last one with the remaining.
I have this code for writing one child tag to each file but unable to come up with groups.
So assume my sample xml with 10 child tags and I want to create 5 files each with 2 child tags.
My sample xml:
<root>
<row>
<NAME>A</NAME>
<FIRSTNAME>A</FIRSTNAME>
<GENDER>M</GENDER>
</row>
<row>
<NAME>B</NAME>
<FIRSTNAME>B</FIRSTNAME>
<GENDER>M</GENDER>
</row>
<row>
<NAME>A</NAME>
<FIRSTNAME>A</FIRSTNAME>
<GENDER>M</GENDER>
</row>
<row>
<NAME>B</NAME>
<FIRSTNAME>B</FIRSTNAME>
<GENDER>M</GENDER>
</row>
<row>
<NAME>A</NAME>
<FIRSTNAME>A</FIRSTNAME>
<GENDER>M</GENDER>
</row>
<row>
<NAME>B</NAME>
<FIRSTNAME>B</FIRSTNAME>
<GENDER>M</GENDER>
</row>
<row>
<NAME>A</NAME>
<FIRSTNAME>A</FIRSTNAME>
<GENDER>M</GENDER>
</row>
<row>
<NAME>B</NAME>
<FIRSTNAME>B</FIRSTNAME>
<GENDER>M</GENDER>
</row>
<row>
<NAME>A</NAME>
<FIRSTNAME>A</FIRSTNAME>
<GENDER>M</GENDER>
</row>
<row>
<NAME>B</NAME>
<FIRSTNAME>B</FIRSTNAME>
<GENDER>M</GENDER>
</row>
</root>
And my result should be 5 files, each having 2 entries as follows:
<root>
<row>
<NAME>A</NAME>
<FIRSTNAME>A</FIRSTNAME>
<GENDER>M</GENDER>
</row>
<row>
<NAME>B</NAME>
<FIRSTNAME>B</FIRSTNAME>
<GENDER>M</GENDER>
</row>
</root>
The below code put each child tag per file but I want here for example 2 tags per file.
import xml.etree.ElementTree as ET
context = ET.iterparse('file.xml', events=('end', ))
index = 0
for event, elem in context:
if elem.tag == 'row':
index += 1
filename = format(str(index) + ".xml")
with open(filename, 'wb') as f:
f.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n")
f.write(ET.tostring(elem))
Thanks in advance!
EDIT to add recipes:
from itertools import zip_longest
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
Upvotes: 0
Views: 399
Reputation: 366073
You have an iterable of (event, elements) pairs:
context = ET.iterparse('file.xml', events=('end', ))
Now, you want to filter this down to just the row
elements:
rows = (elem for event, elem in context if elem.tag == 'row')
Now you want to group them. Use the grouper
recipe from the itertools
docs:
groups = grouper(rows, 2)
You can obviously change that 2
to 1000
or whatever once you get things working and want to run it for real.
Now, you can just iterate the groups. While we're at it, let's use enumerate
so you don't need that manual index += 1
stuff. Also, instead of building a string manually and then pointlessly calling format
on it, let's just use an f-string.
for index, group in enumerate(groups):
# If you need to run on 3.5 or 2.7, use "{}.xml".format(index)
filename = f"{index}.xml"
with open(filename, 'wb') as f:
f.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n")
… then iterate the elements within the group—but be careful; if you had an odd number of elements, grouper
will fill in the incomplete last group with None
values.1
for elem in group:
if elem:
f.write(ET.tostring(elem))
1. This isn't that hard to change, but I'm using the recipe directly out of the docs so I don't have to explain how to change it.
Upvotes: 1