Abraham Simpson
Abraham Simpson

Reputation: 350

SAXParseException working with GraphMLReader of Prefuse

I'm writing a program in Java that uses the prefuse library. The program generates graphs from information collected from twitter. I'm trying to make my program to save the generated graphs so later I can load them.

The prefuse class GraphMLWriter works fine and it generates a graphml encoded in UTF-8 and xml version: 1.0.

My problem appears when I want to load the generated graphml file. To do that I use the method readGraph(InputStream is) of the class GraphMLReader. This method return a Graph object and use a SaxParser to parse the graphml file with a handler object of the class GraphMLHandler. This object constructs the graph as de parser parse all the lines of the xml file. I'm getting a SAXParseException throwed by prefuse.data.io.DataIOException when the xml file has characters like 'á' or 'ñ' or emoticons. All the xml files generated contains Strings that represent tweets.

An example is:

<data key="info">Las extra&#241;o muchooooo a ambas! &#55357;&#56469;</data>

The error says:

Exception in thread "main" prefuse.data.io.DataIOException: >org.xml.sax.SAXParseException; lineNumber: 165; columnNumber: 67; The character reference "&#

and nothing else, it seems that the error message is cut.

These is the code that I use to save a graph object 'g' into and GraphML called "Saved graph":

(new GraphMLWriter()).writeGraph(graph, "Graph saved"); 

And these is the one which I use to load the graph into a graph 'g2' generated from a GraphML file called "Graph saved"

Graph g2 = (new GraphMLReader().readGraph("Graph saved")); 

What can I do to resolve this problem?

Upvotes: 0

Views: 214

Answers (1)

James Fry
James Fry

Reputation: 1163

&#55357 and &#56469 are surrogate parts, so I'm guessing your original data contains some extended unicode characters. It appears that the prefuse GraphMLWriter creates an XMLWriter that makes some assumptions about encodings that aren't necessarily correct - it assumes that all characters in a String are 16 bit code points and encodes them accordingly. In this case we appear to have a surrogate pair and some smarter handling is required (to be fair to the original author, seeing such values in the wild in 2005/2006 was somewhat unusual, and pretty much everyone assumed that Unicode meant 16 bits per character).

Regardless, I think the only options you have here are to pre-filter your data, or patch the prefuse library. If you adverse to forking, one approach would be to extend GraphMLWriter and override writeGraph with an almost exact copy substituting the creation of XMLWriter on line 73 with the creation of your own extended XMLWriter in which you override escapeString to deal with the surrogates properly. Java's Character class provides methods that tell you if a char is a surrogate, and if a pair of a characters make a valid surrogate pair - if you find such a pair you can then generate the correct XML entity.

Upvotes: 0

Related Questions