Stephen
Stephen

Reputation: 6319

Remove duplicate nodes in an XML

I'm generating an XML file using Python and markup.py ....it was all working out but due to recent changes in the script, I'm now getting duplicated values in the nodes due to the checks I put in place. Here's a sample of the output (they are vehicle records):

<?xml version='1.0' encoding='UTF-8' ?>
<datafeed>
<vehicle>
    <vin>2HNYD18816H532105</vin>
    <features>
        <feature>AM/FM Radio</feature>
        <feature>Air Conditioning</feature>
        <feature>Anti-Lock Brakes (ABS)</feature>
        <feature>Alarm</feature>
        <feature>CD Player</feature>
        <feature>Air Bags</feature>
        <feature>Air Bags</feature>
        <feature>Anti-Lock Brakes (ABS)</feature>
        <feature>Alarm</feature>
        <feature>Air Bags</feature>
        <feature>Alarm</feature>
        <feature>Air Bags</feature>
    </features>
</vehicle>
<vehicle>
    <vin>2HKYF18746H537006</vin>
    <features>
        <feature>AM/FM Radio</feature>
        <feature>Anti-Lock Brakes (ABS)</feature>
        <feature>Air Bags</feature>
        <feature>Air Bags</feature>
        <feature>Anti-Lock Brakes (ABS)</feature>
        <feature>Alarm</feature>
        <feature>Air Bags</feature>
        <feature>Alarm</feature>
    </features>
</vehicle>
</datafeed>

This is a small excerpt from a larger XML file having over 100 records. What can I do to remove the duplicate nodes?

Upvotes: 1

Views: 1901

Answers (1)

poke
poke

Reputation: 388023

There are no real "duplicates" in XML. Every node is different by definition. But I understand you that you want to get rid of all duplicate features in your interpretion.

You can do this by simply parsing that tree, putting the features (the values of the nodes) in a set (to get rid of duplicates) and writing out a new XML document.

Given that you are generating the file with Python, you should modify the creation routine the way that it doesn't generate duplicate values to begin with. You might want to tell us what the markup.py is or does.

edit

I just took a quick look at the markup script, so something like this might appear in your script:

// well, this might come from somewhere else, but I guess you have such a list somewhere
features = [ 'AM/FM Radio', 'Air Conditioning', 'Anti-Lock Brakes (ABS)', 'Alarm', 'CD Player', 'Air Bags', 'Air Bags', 'Anti-Lock Brakes (ABS)', 'Alarm', 'Air Bags', 'Alarm', 'Air Bags' ]

// write the XML
markup.features.open()
markup.feature( features )
markup.features.close()

In this case, just make features a set before passing it to the markup script:

// write the XML
markup.features.open()
markup.feature( set( features ) )
markup.features.close()

If you have multiple separate lists that contain your features for a single vehicle, combine those lists (or sets) first:

list1 = [...]
list2 = [...]
list3 = [...]
features = set( list1 + list2 + list3 )

Upvotes: 1

Related Questions