Reputation: 453

Tool to find duplicate sections in a text (XML) file?

I have an XML file, and I want to find nodes that have duplicate CDATA. Are there any tools that exist that can help me do this?

I'd be fine with a tool that does this generally for text documents.

Upvotes: 1

Answers (5)

tephyr

Reputation: 1051

A very similar question (asked a year after this one) has some answers with very good tools for diffing chunks within the same file, including Atomiq.

Upvotes: 0

bortzmeyer

Reputation: 35519

Here is a first attempt, written in Python and using only standard libraries. You can improve it in many ways (trim leading and ending whitespaces, computing a hash of the text to decrease memory requirments, better displaying of the elements, with their line number, etc):

import xml.etree.ElementTree as ElementTree
import sys

def print_elem(element):
    return "<%s>" % element.tag

if len(sys.argv) != 2:
    print >> sys.stderr, "Usage: %s filename" % sys.argv[0]
    sys.exit(1)
filename = sys.argv[1]    
tree = ElementTree.parse(filename)
root = tree.getroot()
chunks = {}
iter = root.findall('.//*')
for element in iter:
    if element.text in chunks:
        chunks[element.text].append(element)
    else:
        chunks[element.text] = [element,]
for text in chunks:
    if len(chunks[text]) > 1:
        print "\"%s\" is a duplicate: found in %s" % \
              (text, map(print_elem, chunks[text]))

If you give it this XML file:

<foo>
<bar>Hop</bar><quiz>Gaw</quiz>
<sub>
<und>Hop</und>
</sub>

it will output:

"Hop" is a duplicate: found in ['<bar>', '<und>']

Upvotes: 2

cjk

Reputation: 46465

You could write a simple C# app that uses Linq to read all the nodes twice as separate entities, then finds all values that are equal.

Upvotes: 0

Stephen Friederichs

Reputation: 1059

Not easily. My first thought is XSLT but it's hard to implement. You'd have to go through each node and then do an XPATH select on every node with the same data. That would find them, but you'd end up processing all of the nodes with the same data later as well (ie, no way to keep track of what node data you've already processed and ignore it). You could do it with a real programming language but that's outside of my experience.

Upvotes: 0

lImbus

Reputation: 1588

never heard about anything like that, but it might be an intresting task to write such a program based on a dictionary coder as used in archivers.

Upvotes: 0

Tool to find duplicate sections in a text (XML) file?

Answers (5)

Related Questions