Amish Programmer
Amish Programmer

Reputation: 2121

Using libxml2 to Parse XML Attributes Containing Invalid Characters

I am attempting to parse XML response messages from a third-party interface that contain illegal characters. Please note that these responses are not within my control.

The following is a modified example response

<?xml version="1.0"?>
<response>
  <data value="Example A" />
  <data value="Example B" />
  <data value="Example C" />
</response>

Occasionally, the "value" attribute might contain the ESC control character [0x1b], which is used (questionably) to indicate special characteristics to be applied to the value.

<?xml version="1.0"?>
<response>
  <data value="[0x1b]Example A" />
  <data value="Example B" />
</response>

I'm using the libxml2 xmlParseMemory() function to attempt to parse this response. http://www.xmlsoft.org/html/libxml-parser.html#xmlParseMemory

I'm calling the function as as follows:

xmlDocPtr doc = xmlParseMemory( buffer, size );

When the response XML is valid, I get a valid xmlDocPtr and can continue to work with it. If the response contains illegal characters, I receive NULL and wind up at a dead end.

Is there any way I parse these messages without receiving errors and without throwing away the illegal characters?

Upvotes: 0

Views: 946

Answers (1)

abligh
abligh

Reputation: 25119

You are asking the unanswerable. Suppose instead of an 0x1B character you got a \n? Or worse an additional "? Or a \? Anything that produces invalid xml is going to make libxml2 choke, because it is an xml parser. And the example you produced is invalid xml. If you want it to parse invalid xml you need to determine how it should parse and either modify libxml2 or modify the xml so it is valid and undo the damage later. The reason it is invalid xml is precisely because it's not obvious how such things should parse.

The best solution is to fix whatever is producing the (alleged) xml to not produce broken xml.

Upvotes: 1

Related Questions