theexplorer
theexplorer

Reputation: 359

Creating an XML based on another XML in Java

I'd like to take an XML file, heavily structured and about half gig in size, and create from it another XML file, containing only selected elements of the original one.

1) How can I do that?

2) can it be done with DOM Parser? What is the size limit of the DOM parser?

Thanks!

Upvotes: 2

Views: 2938

Answers (2)

Michael Kay
Michael Kay

Reputation: 163322

500Mb is well within the limits of what can be achieved using XSLT. It depends a little bit on how much effort you want to expend to develop an optimum solution: i.e., which is more expensive, your time or the machine's time?

Upvotes: 1

helderdarocha
helderdarocha

Reputation: 23637

If you have a very large source XML (like your 0.5 GB file), and wish to extract information from it, possibly creating a new XML, you might consider using an event-based parser which does not require loading the entire XML in memory. The simplest of these implementations is the SAX parser, which requires that you write an event listener which will capture events like document-start, element-start, element-end, etc, where you can inspect the data you are reading (the name of the element, the attributes, etc.) and decide if you are going to ignore it or do something with the data.

Search for a SAX tutorial using JAXP and you should find several examples. Another strategy which you might want to consider, depending on what you want to do is StAX.

Here is a simple example using SAX to read data from a XML file and extract some information based on search criteria. It's a very simple example I use to teach SAX processing. I think it might help your understanding of how it works. The search criteria is hardwired and consists of names of movie directors to search in a giant XML with a movie selection generated from IMDB data.

XML Source example ("source.xml" ~300MB file)

<Movies>
    ...
    <Movie>
        <Imdb>tt1527186</Imdb>
        <Title>Melancholia</Title>
        <Director>Lars von Trier</Director>
        <Year>2011</Year>
        <Duration>136</Duration>
    </Movie>
    <Movie>
        <Imdb>tt0060390</Imdb>
        <Title>Fahrenheit 451</Title>
        <Director>François Truffaut</Director>
        <Year>1966</Year>
        <Duration>112</Duration>
    </Movie>
    <Movie>
        <Imdb>tt0062622</Imdb>
        <Title>2001: A Space Odyssey</Title>
        <Director>Stanley Kubrick</Director>
        <Year>1968</Year>
        <Duration>160</Duration>
    </Movie>
    ...
</Movies>

Here is an example of an event handler. It selects the Movie elements by matching strings. I extended DefaultHandler and implemented startElement() (called when an opening tag is found), characters() (called when a block of characters are read), endElement() (called when an end tag is found) and endDocument() (called once, when the document finished). Since the data that is read is not retained in memory, you have to save the data you are interested in yourself. I used some boolean flags and instance variables to save the current tag, current data, etc.

class ExtractMovieSaxHandler extends DefaultHandler {

    // These are some parameters for the search which will select 
    // the subtrees (they will receive data when we set up the parser)
    private String tagToMatch;
    private String tagContents; // OR match
    private boolean strict = false;  // if strict matches will be exact

    /**
     * Sets criteria to select and copy Movie elements from source XML.
     *
     * @param tagToMatch Must contain text only
     * @param tagContents Text contents of the tag
     * @param strict If true, match must be exact
     */
    public void setSearchCriteria(String tagToMatch, String tagContents, boolean strict) {
        this.tagToMatch = tagToMatch;
        this.tagContents = tagContents;
        this.strict = strict;
    }

    // These are the temporary values we store as we parse the file
    private String currentElement;
    private StringBuilder contents = null; // if not null we are in Movie tag
    private String currentData;
    List<String> result = new ArrayList<String>(); // store resulting nodes here
    private boolean skip = false;

...

These methods are the implementation of the ContentHandler. The first one detects an element was found (start tag). We save the name of the tag (child of Movie) in a variable, because it might be one we use in the search:

...

    @Override
    public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {

        // Store the current element that started now
        currentElement = qName;

        // If this is a Movie tag, save the contents because we might need it
        if (qName.equals("Movie")) {
            contents = new StringBuilder();
        }

    }
...    

This one is called every time a block of characters is called. We check if those characters are occurring inside an element which interests us. If it is, we match the contents and save it if it matches.

...
    @Override
    public void characters(char[] ch, int start, int length) throws SAXException {

        // if we discovered that we don't need this data, we skip it
        if (skip || currentElement == null) {
            return;
        }

        // If we are inside the tag we want to search, save the contents
        currentData = new String(ch, start, length);

        if (currentElement.equals(tagToMatch)) {
            boolean discard = true;

            if (strict) {
                if (currentData.equals(tagContents)) { // exact match
                    discard = false;
                }

            } else {
                if (currentData.toLowerCase().indexOf(tagContents.toLowerCase()) >= 0) { // matches occurrence of substring
                    discard = false;
                }
            }

            if (discard) {
                skip = true;
            }
        }

    }
...    

This is called when an end tag is found. We can now append it to the document we are building in memory if we wish.

...
    @Override
    public void endElement(String uri, String localName, String qName) throws SAXException {

        // Rebuild the XML if it's a node we didn't skip
        if (qName.equals("Movie")) {
            if (!skip) {
                result.add(contents.insert(0, "<Movie>").append("</Movie>").toString());
            }

            // reset the variables so we can check the next node
            contents = null;
            skip = false;
        } else if (contents != null && !skip) {
            contents.append("<").append(qName).append(">")
                    .append(currentData)
                    .append("</").append(qName).append(">");
        }

        currentElement = null;
    }
...    

Finally, this one is called when the document ends. I also used it to print the result at the end.

...
    @Override
    public void endDocument() throws SAXException {
        StringBuilder resultFile = new StringBuilder();
        resultFile.append("<?xml version=\"1.0\" encoding=\"UTF-8\"?>");
        resultFile.append("<Movies>");
        for (String childNode : result) {
            resultFile.append(childNode.toString());
        }
        resultFile.append("</Movies>");

        System.out.println("=== Resulting XML containing Movies where " + tagToMatch + " is one of " + tagContents + " ===");
        System.out.println(resultFile.toString());
    }

}

Here is a small Java application which loads that file, and uses an event handler to extract the data.

public class SAXReaderExample {

    public static final String PATH = "src/main/resources"; // this is where I put the XML file

    public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException {

        // Obtain XML Reader
        SAXParserFactory spf = SAXParserFactory.newInstance();
        SAXParser sp = spf.newSAXParser();
        XMLReader reader = sp.getXMLReader();

        // Instantiate SAX handler
        ExtractMovieSaxHandler handler = new ExtractMovieSaxHandler();

        // set search criteria
        handler.setSearchCriteria("Director", "Kubrick", false);

        // Register handler with XML reader
        reader.setContentHandler(handler);

        // Parse the XML
        reader.parse(new InputSource(new FileInputStream(new File(PATH, "source.xml"))));
    }
}

Here is the resulting file, after processing:

<?xml version="1.0" encoding="UTF-8"?>
<Movies>
    <Movie>
        <Imdb>tt0062622</Imdb>
        <Title>2001: A Space Odyssey</Title>
        <Director>Stanley Kubrick</Director>
        <Year>1968</Year>
        <Duration>160</Duration>
    </Movie>
    <Movie>
        <Imdb>tt0066921</Imdb>
        <Title>A Clockwork Orange</Title>
        <Director>Stanley Kubrick</Director>
        <Year>1972</Year>
        <Duration>136</Duration>
    </Movie>
    <Movie>
        <Imdb>tt0081505</Imdb>
        <Title>The Shining</Title>
        <Director>Stanley Kubrick</Director>
        <Year>1980</Year>
        <Duration>144</Duration>
    </Movie>
    ...
</Movies>

Your scenario might be different, but this example shows a general solution which you can probably adapt to your problem. You can find more information in tutorials about SAX and JAXP.

Upvotes: 2

Related Questions