Johannes Ernst
Johannes Ernst

Reputation: 3186

Parsing XML file containing HTML entities in Java without changing the XML

I have to parse a bunch of XML files in Java that sometimes -- and invalidly -- contain HTML entities such as —, > and so forth. I understand the correct way of dealing with this is to add suitable entity declarations to the XML file before parsing. However, I can't do that as I have no control over those XML files.

Is there some kind of callback I can override that is invoked whenever the Java XML parser encounters such an entity? I haven't been able to find one in the API.

I'd like to use:

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();

DocumentBuilder parser = dbf.newDocumentBuilder();
Document        doc    = parser.parse( stream );

I found that I can override resolveEntity in org.xml.sax.helpers.DefaultHandler, but how do I use this with the higher-level API?

Here's a full example:

public class Main {
    public static void main( String [] args ) throws Exception {
        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        DocumentBuilder parser = dbf.newDocumentBuilder();
        Document        doc    = parser.parse( new FileInputStream( "test.xml" ));
    }

}

with test.xml:

<?xml version="1.0" encoding="UTF-8"?>
<foo>
    <bar>Some&nbsp;text &mdash; invalid!</bar>
</foo>

Produces:

[Fatal Error] :3:20: The entity "nbsp" was referenced, but not declared.
Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 3; columnNumber: 20; The entity "nbsp" was referenced, but not declared.

Update: I have been poking around in the JDK source code with a debugger, and boy, what an amount of spaghetti. I have no idea what the design is there, or whether there is one. Just how many layers of an onion can one layer on top of each other?

They key class seems to be com.sun.org.apache.xerces.internal.impl.XMLEntityManager, but I cannot find any code that either lets me add stuff into it before it gets used, or that attempts to resolve entities without going through that class.

Upvotes: 22

Views: 15926

Answers (6)

V_Dev
V_Dev

Reputation: 93

Try this using org.apache.commons package :

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = dbf.newDocumentBuilder();

InputStream in = new FileInputStream(xmlfile);    
String unescapeHtml4 = IOUtils.toString(in);

CharSequenceTranslator obj = new AggregateTranslator(new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE()),
          new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE())    
         );

unescapeHtml4 = obj.translate(unescapeHtml4);
StringReader readerInput= new StringReader(unescapeHtml4);

InputSource is = new InputSource(readerInput);
Document doc    = parser.parse(is);    

Upvotes: 1

Marek Derdzinski
Marek Derdzinski

Reputation: 45

I made yesterday something similar i need to add value from unziped XML in stream to database.

//import I'm not sure if all are necessary :) 
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.*;
import org.w3c.dom.Document;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

//I didnt checked this code now because i'm in work for sure its work maybe 
you will need to do little changes
InputSource is = new InputSource(new FileInputStream("test.xml"));

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(is);
XPathFactory xpf = XPathFactory.newInstance();
XPath xpath = xpf.newXPath();
String words= xpath.evaluate("/foo/bar", doc.getDocumentElement());
ParsingHexToChar.parseToChar(words);

// lib which i use common-lang3.jar
//metod to parse 
public static String parseToChar( String words){

    String decode= org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4(words);

        return decode;
 }

Upvotes: 1

SkyWalker
SkyWalker

Reputation: 29150

Issue - 1: I have to parse a bunch of XML files in Java that sometimes -- and invalidly -- contain HTML entities such as &mdash;

XML has only five predefined entities. The &mdash;, &nbsp; is not among them. It works only when used in plain HTML or in legacy JSP. So, SAX will not help. It can be done using StaX which has high level iterator based API. (Collected from this link)

Issue - 2: I found that I can override resolveEntity in org.xml.sax.helpers.DefaultHandler, but how do I use this with the higher-level API?

Streaming API for XML, called StaX, is an API for reading and writing XML Documents.

StaX is a Pull-Parsing model. Application can take the control over parsing the XML documents by pulling (taking) the events from the parser.

The core StaX API falls into two categories and they are listed below. They are

  • Cursor based API: It is low-level API. cursor-based API allows the application to process XML as a stream of tokens aka events

  • Iterator based API: The higher-level iterator-based API allows the application to process XML as a series of event objects, each of which communicates a piece of the XML structure to the application.

STaX API has support for the notion of not replacing character entity references, by way of the IS_REPLACING_ENTITY_REFERENCES property:

Requires the parser to replace internal entity references with their replacement text and report them as characters

This can be set into an XmlInputFactory, which is then in turn used to construct an XmlEventReader or XmlStreamReader.

However, the API is careful to say that this property is only intended to force the implementation to perform the replacement, rather than forcing it to notreplace them.

You may try it. Hope it will solve your issue. For your case,

Main.java

import java.io.FileInputStream;
import java.io.FileNotFoundException;

import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.EntityReference;
import javax.xml.stream.events.XMLEvent;

public class Main {

    public static void main(String[] args) {
        XMLInputFactory inputFactory = XMLInputFactory.newInstance();
        inputFactory.setProperty(
                XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, false);
        XMLEventReader reader;
        try {
            reader = inputFactory
                    .createXMLEventReader(new FileInputStream("F://test.xml"));
            while (reader.hasNext()) {
                XMLEvent event = reader.nextEvent();
                if (event.isEntityReference()) {
                    EntityReference ref = (EntityReference) event;
                    System.out.println("Entity Reference: " + ref.getName());
                }
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (XMLStreamException e) {
            e.printStackTrace();
        }
    }
}

test.xml:

<?xml version="1.0" encoding="UTF-8"?>
<foo>
    <bar>Some&nbsp;text &mdash; invalid!</bar>
</foo>

Output:

Entity Reference: nbsp

Entity Reference: mdash

Credit goes to @skaffman.

Related Link:

  1. http://www.journaldev.com/1191/how-to-read-xml-file-in-java-using-java-stax-api
  2. http://www.journaldev.com/1226/java-stax-cursor-based-api-read-xml-example
  3. http://www.vogella.com/tutorials/JavaXML/article.html
  4. Is there a Java XML API that can parse a document without resolving character entities?

UPDATE:

Issue - 3: Is there a way to use StaX to "filter" the entities (replacing them with something else, for example) and still produce a Document at the end of the process?

To create a new document using the StAX API, it is required to create an XMLStreamWriter that provides methods to produce XML opening and closing tags, attributes and character content.

There are 5 methods of XMLStreamWriter for document.

  1. xmlsw.writeStartDocument(); - initialises an empty document to which elements can be added
  2. xmlsw.writeStartElement(String s) -creates a new element named s
  3. xmlsw.writeAttribute(String name, String value)- adds the attribute name with the corresponding value to the last element produced by a call to writeStartElement. It is possible to add attributes as long as no call to writeElementStart,writeCharacters or writeEndElement has been done.
  4. xmlsw.writeEndElement - close the last started element
  5. xmlsw.writeCharacters(String s) - creates a new text node with content s as content of the last started element.

A sample example is attached with it:

StAXExpand.java

import  java.io.BufferedReader;
import  java.io.FileReader;
import  java.io.IOException;

import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamWriter;

import java.util.Arrays;

public class StAXExpand {   
    static XMLStreamWriter xmlsw = null;
    public static void main(String[] argv) {
        try {
            xmlsw = XMLOutputFactory.newInstance()
                          .createXMLStreamWriter(System.out);
            CompactTokenizer tok = new CompactTokenizer(
                          new FileReader(argv[0]));

            String rootName = "dummyRoot";
            // ignore everything preceding the word before the first "["
            while(!tok.nextToken().equals("[")){
                rootName=tok.getToken();
            }
            // start creating new document
            xmlsw.writeStartDocument();
            ignorableSpacing(0);
            xmlsw.writeStartElement(rootName);
            expand(tok,3);
            ignorableSpacing(0);
            xmlsw.writeEndDocument();

            xmlsw.flush();
            xmlsw.close();
        } catch (XMLStreamException e){
            System.out.println(e.getMessage());
        } catch (IOException ex) {
            System.out.println("IOException"+ex);
            ex.printStackTrace();
        }
    }

    public static void expand(CompactTokenizer tok, int indent) 
        throws IOException,XMLStreamException {
        tok.skip("["); 
        while(tok.getToken().equals("@")) {// add attributes
            String attName = tok.nextToken();
            tok.nextToken();
            xmlsw.writeAttribute(attName,tok.skip("["));
            tok.nextToken();
            tok.skip("]");
        }
        boolean lastWasElement=true; // for controlling the output of newlines 
        while(!tok.getToken().equals("]")){ // process content 
            String s = tok.getToken().trim();
            tok.nextToken();
            if(tok.getToken().equals("[")){
                if(lastWasElement)ignorableSpacing(indent);
                xmlsw.writeStartElement(s);
                expand(tok,indent+3);
                lastWasElement=true;
            } else {
                xmlsw.writeCharacters(s);
                lastWasElement=false;
            }
        }
        tok.skip("]");
        if(lastWasElement)ignorableSpacing(indent-3);
        xmlsw.writeEndElement();
   }

    private static char[] blanks = "\n".toCharArray();
    private static void ignorableSpacing(int nb) 
        throws XMLStreamException {
        if(nb>blanks.length){// extend the length of space array 
            blanks = new char[nb+1];
            blanks[0]='\n';
            Arrays.fill(blanks,1,blanks.length,' ');
        }
        xmlsw.writeCharacters(blanks, 0, nb+1);
    }

}

CompactTokenizer.java

import  java.io.Reader;
import  java.io.IOException;
import  java.io.StreamTokenizer;

public class CompactTokenizer {
    private StreamTokenizer st;

    CompactTokenizer(Reader r){
        st = new StreamTokenizer(r);
        st.resetSyntax(); // remove parsing of numbers...
        st.wordChars('\u0000','\u00FF'); // everything is part of a word
                                         // except the following...
        st.ordinaryChar('\n');
        st.ordinaryChar('[');
        st.ordinaryChar(']');
        st.ordinaryChar('@');
    }

    public String nextToken() throws IOException{
        st.nextToken();
        while(st.ttype=='\n'|| 
              (st.ttype==StreamTokenizer.TT_WORD && 
               st.sval.trim().length()==0))
            st.nextToken();
        return getToken();
    }

    public String getToken(){
        return (st.ttype == StreamTokenizer.TT_WORD) ? st.sval : (""+(char)st.ttype);
    }

    public String skip(String sym) throws IOException {
        if(getToken().equals(sym))
            return nextToken();
        else
            throw new IllegalArgumentException("skip: "+sym+" expected but"+ 
                                               sym +" found ");
    }
}

For more, you can follow the tutorial

  1. https://docs.oracle.com/javase/tutorial/jaxp/stax/example.html
  2. http://www.ibm.com/developerworks/library/x-tipstx2/index.html
  3. http://www.iro.umontreal.ca/~lapalme/ForestInsteadOfTheTrees/HTML/ch09s03.html
  4. http://staf.sourceforge.net/current/STAXDoc.pdf

Upvotes: 9

applecrusher
applecrusher

Reputation: 5648

I would use a library like Jsoup for this purpose. I tested the following below and it works. I don't know if this helps. It can be located here: http://jsoup.org/download

public static void main(String args[]){


    String html = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><foo>" + 
                  "<bar>Some&nbsp;text &mdash; invalid!</bar></foo>";
    Document doc = Jsoup.parse(html, "", Parser.xmlParser());

    for (Element e : doc.select("bar")) {
        System.out.println(e);
    }   


}

Result:

<bar>
 Some&nbsp;text — invalid!
</bar>

Loading from a file can be found here:

http://jsoup.org/cookbook/input/load-document-from-file

Upvotes: 11

Richard
Richard

Reputation: 1130

Another approach, since you're not using a rigid OXM approach anyway. You might want to try using a less rigid parser such as JSoup? This will stop immediate problems with invalid XML schemas etc, but it will just devolve the problem into your code.

Upvotes: 3

rpy
rpy

Reputation: 4013

Just to throw in a different approach to a solution:

You might envelope your input stream with a stream inplementation that replaces the entities by something legal.

While this is a hack for sure, it should be a quick and easy solution (or better say: workaround).
Not as elegant and clean as a xml framework internal solution, though.

Upvotes: 1

Related Questions