Thang Pham
Thang Pham

Reputation: 38705

Java: Given a List of file names, make sure the corresponding XML only contains information about these filess

I have a List of files (20,000 to 50,000 files), and a large xml file. I want the file XML to only contains information about the file in the List.

For example, let say we have only file XYZ on our list, and XML files look as below.

<?xml version="1.0" encoding="ISO-8859-1"?>
<index>
<document>
    <entry number="1">
        <commentfield>
            <name>FileName</name>
            <value>XYZ</value>
        </commentfield>
    </entry>
    <entry number="2">
        <commentfield>
            <name>Note</name>
            <value>03-000</value>
        </commentfield>
    </entry>
</document>
<document>
    <entry number="1">
        <commentfield>
            <name>FileName</name>
            <value>ABC</value>
        </commentfield>
    </entry>
</document>
...
</index>

The XML contains information of two files, XYZ and ABC. Therefore, I do not want the final XML to contains the last <document> ... ABC ... </document> because this document ABC is not on our List. I have requirements successfully work in KSH script, but it runs too slow (over 4 hours for 22000 files. Well it also does something else). But I decide to port over to Java for better performance. What I have done is read line by line into a String, and when i hit </document>, then I parse out the name of the file, check if this files exist on our list, if so then write this whole <document> ... </document> to another xml file, then read again the next <document>. Is there a better way?

Already able to write code to accomplish this using DOM parser. The code are long, so if you need it, please pm me. tyvm for your help

Upvotes: 0

Views: 295

Answers (4)

user177800
user177800

Reputation:

There are multiple ways to approach this:

XSLT would make this very simple if you have a fixed input list you can write a transform that only selects valid elements and outputs them. This way you don't have to actually write any code and can use something like xsltproc that is very fast!

This is what I would try first because it specifically created for transforming XML into other XML, it is less code and less code is less maintenance.

Here is an idea of how to get started, this outputs all the <document/> elements where the <value/> elements is not equal to ABC.

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method="xml"/>

    <!-- this matches ALL nodes and ALL attributes -->
    <xsl:template match="node()|@*">
      <xsl:copy>
         <xsl:apply-templates select="node()|@*"/>
      </xsl:copy>
    </xsl:template>

    <!-- this matches the entire document element where value = 'ABC' -->
    <xsl:template match="document[entry[commentfield[value[(text()='ABC')]]]]"/>

</xsl:stylesheet>

There are plenty of resources and good books on XSLT all you need to do is provide a whitelist of supported <value/> elements and reverse the logic in my example.

If you have an .xsd or you can create one, your input file doesn't look very complicated, you can use JAXB to automatically generate a Object hierarchy to parse the input file and then you can walk the resulting Object graph and remove anything that doesn't meet your criteria and Marshall it back to a file.

JAXB isn't very viable if the file size is larger than what will fit into memory.

Upvotes: 1

Aron
Aron

Reputation: 1642

'Parsing' an XML input yourself using regex or whatever is a brittle solution that will place unnecessary restrictions on the format of the input text (around whitespace and such). There's no need for it when the Java library comes with several XML parsers.

Using DOM might be the easiest way to go, if you can guarantee that your input XML won't grow too large to slurp into memory at once. You can:

  1. Read the XML into a DOM structure
  2. Traverse the DOM and modify it, removing the unwanted nodes
  3. Write the modified DOM to a new file using a Transformer. Example here.

A more efficient option might be StAX, which doesn't require the entire input to be read in at once. I haven't used it, but it has the ability to read as well as write documents. You could read a <document> element at a time, and write it back to an output file if it's in the list. A bit of a tutorial here.

Upvotes: 2

Aron
Aron

Reputation: 1642

Ignoring, for the moment, details of the best way to parse and re-write the XML, the basic strategy of reading once through the XML file and looking for each file name in the list seems sound.

However, you might be able to improve they way you check for presence in the list of filenames (you don't specify how you're doing that). A couple of possibilities:

  1. Put the filenames in a Set, and check for presence in the set, which will be an O(1) or O(log N) operation
  2. Sort the list of filenames and perform a binary search, which will be an O(log N) operation.

Either way would be an improvement over a simple linear search through an unsorted list.

Upvotes: 1

Mike Milkin
Mike Milkin

Reputation: 4279

You can use Xpath to get the elements, if you know of the structure of the xml you can then remove those elements. Depending how you are processing your xml you can either use DOM (probably not a good idea for large XMLs)

Upvotes: 0

Related Questions