Reputation: 1447
I am writing a RESTful web service in Java. The idea is to "cut down" an XML document and strip away all the unneeded content (~98%) and leave only the tags we're interested in, while maintaining the document's structure, which is as follows (I cannot provide the actual XML content for confidentiality reasons):
<sear:SEGMENTS xmlns="http://www.exlibrisgroup.com/xsd/primo/primo_nm_bib" xmlns:sear="http://www.exlibrisgroup.com/xsd/jaguar/search">
<sear:JAGROOT>
<sear:RESULT>
<sear:DOCSET IS_LOCAL="true" TOTAL_TIME="176" LASTHIT="9" FIRSTHIT="0" TOTALHITS="262" HIT_TIME="11">
<sear:DOC SEARCH_ENGINE_TYPE="Local Search Engine" SEARCH_ENGINE="Local Search Engine" NO="1" RANK="0.086826384" ID="2347460">
[
<PrimoNMBib>
<record>
<display>
<title></title>
</display>
<sort>
<author></author>
</sort>
</record>
</PrimoNMBib>
]
</sear:DOC>
</sear:DOCSET>
</sear:RESULT>
</sear:JAGROOT>
</sear:SEGMENTS>
Of course, this is the structure of only the tags we are interested in - there are hundreds more tags, but they are irrelevant.
The square brackets ([]
) are not part of the XML and indicate that the element <PrimoNMBib></PrimoNMBib>
are elements of a list of children and occur more than once - one per match of the search from the RESTFUL service.
I've been trying to parse the document with regular expressions, as to leave only the segments of the structure as shown above along with the values of <title>
and <author>
while removing everything else in-between the tags including other tags, however I can't get it to work for the life of me...
Previously I tried it using XSLT, however for unresolved reasons that didn't work either... I'd already asked a question for the XSLT implementation...
Anyway, I would very much appreciate a tip/hint/solution as how to solve this problem using regex and Java...
Upvotes: 0
Views: 292
Reputation: 149017
I wouldn't recommend using regex to manipulate XML.
Alternative Approach
You could use a StAX parser that leverages a StreamFilter
to cut down the document and still maintain a valid structure.
How a StreamFilter
Works
A StreamFilter
receives event event from the XMLStreamReader
, if you want to have the event reported you return true, otherwise false. In the example below the StreamFilter
will reject anything in the "http://www.exlibrisgroup.com/xsd/jaguar/search"
namespace. You will need to tweak the logic to get it to match the requirements of your use case.
Demo
package forum10351473;
import java.io.FileReader;
import javax.xml.stream.*;
public class Demo {
public static void main(String[] args) throws Exception {
XMLInputFactory xif = XMLInputFactory.newFactory();
XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("src/forum10351473/input.xml"));
xsr = xif.createFilteredReader(xsr, new StreamFilter() {
private boolean reportContent = false;
@Override
public boolean accept(XMLStreamReader reader) {
if(reader.isStartElement() || reader.isEndElement()) {
reportContent = !"http://www.exlibrisgroup.com/xsd/jaguar/search".equals(reader.getNamespaceURI());
}
return reportContent;
}
});
// The XMLStreamReader (xsr) will now only report the events you care about.
// You can process the XMLStreamReader yourself or pass as input to something
// like JAXB.
while(xsr.hasNext()) {
if(xsr.isStartElement()) {
System.out.println(xsr.getLocalName());
}
xsr.next();
}
}
}
Output
PrimoNMBib
record
display
title
sort
author
Upvotes: 1