Samer A.
Samer A.

Reputation: 65

How to sort xml by node value in python

I'm running on Ubuntu 18.04. A Python 2 or 3 solution would be preferred. I've got xml structured like so:

<Records>
  <Record>
    <recID>123</recID>
    <tstamp>2018-12-31T23:59:42.38Z</tstamp>
  </Record>
  <Record>
    <recID>456</recID>
    <tstamp>2018-10-10T12:03:02.28Z</tstamp>
  </Record>
  <Record>
    <recID>789</recID>
    <tstamp>2018-11-11T13:50:00.00Z</tstamp>
  </Record>
</Records>

But I've got a lot of it, a single 10GB file worth. I'm looking for the most efficient way to sort the records on tstamp, such that the output would look like this:

<Records>
  <Record>
    <recID>456</recID>
    <tstamp>2018-10-10T12:03:02.28Z</tstamp>
  </Record>
  <Record>
    <recID>789</recID>
    <tstamp>2018-11-11T13:50:00.00Z</tstamp>
  </Record>
  <Record>
    <recID>123</recID>
    <tstamp>2018-12-31T23:59:42.38Z</tstamp>
  </Record>
</Records>

Thanks in advance.

Upvotes: 0

Views: 138

Answers (1)

balderman
balderman

Reputation: 23825

Below is a code that sort the records by 'tstamp'

import datetime
import xml.etree.ElementTree as ET

xml = '''<Records>
  <Record>
    <recID>123</recID>
    <tstamp>2018-12-31T23:59:42.38Z</tstamp>
  </Record>
  <Record>
    <recID>456</recID>
    <tstamp>2018-10-10T12:03:02.28Z</tstamp>
  </Record>
  <Record>
    <recID>99</recID>
    <tstamp>1999-11-11T13:50:00.00Z</tstamp>
  </Record>
  <Record>
    <recID>88</recID>
    <tstamp>2020-11-11T13:50:00.00Z</tstamp>
  </Record>
  <Record>
    <recID>789</recID>
    <tstamp>2018-11-11T13:50:00.00Z</tstamp>
  </Record>
   <Record>
    <recID>11</recID>
    <tstamp>2012-11-11T13:50:00.00Z</tstamp>
  </Record>
</Records>'''
root = ET.fromstring(xml)
records = root.findall('.//Record')
records = sorted(records, key=lambda r: datetime.datetime.strptime(r.find('tstamp').text[:19], '%Y-%m-%dT%H:%M:%S'))
for r in records:
    print(f'{r.find("tstamp").text} -- {r.find("recID").text}')
root = ET.Element('Records')
root.extend(records)

ET.ElementTree(root).write('c:\\temp\\output.xml')

output

1999-11-11T13:50:00.00Z -- 99
2012-11-11T13:50:00.00Z -- 11
2018-10-10T12:03:02.28Z -- 456
2018-11-11T13:50:00.00Z -- 789
2018-12-31T23:59:42.38Z -- 123
2020-11-11T13:50:00.00Z -- 88

Upvotes: 1

Related Questions