Bryan P
Bryan P

Reputation: 6220

Parsing XML escape sequences with MATLAB's xmlread()

MATLAB's (v2103a) xmlread() function fails when it encounters international escape sequences such as ñ. Does anyone have a clean work around?

For example when parsing an XML file with the following XML snippet:

<Cell><Data ss:Type="String">Perdidas - A&ntilde;o0 (euros)</Data></Cell>

xmlread() fails with the following error:

[Fatal Error] resultados.xml:236:50: The entity "ntilde" was referenced, but not declared.

Upvotes: 1

Views: 719

Answers (1)

horchler
horchler

Reputation: 18484

Matlab's tools for dealing with DTDs are incomplete. Notably, if you read in an XML file using xmlread with an included DTD, and then use xmlwrite to save it back out, all of the DTD content will be stripped out (entity substitutions are performed, so you can still parse and read in the new file without errors). There's no simple and truly robust way of just inserting a DTD – this XML where everything is strict and one ought to be very careful when reading and writing from the files.

However, using some old code, I've hacked together a non-robust solution that may work in simple cases as long as you check the output. You can download the M-file and an example XML file here. The xmlentity function adds DTD entities to an XML file by reading in the contents, performing some crude parsing, and writing out the new data.

I used the following 'example.xml' XML file (from here), edited to include some HTML entities:

<?xml version="1.0" encoding="utf-8"?>
<AddressBook>
   <Entry>
      <Name>Frie&ntilde;dly J.&nbsp;Mathworker</Name>
      <PhoneNumber>(508) 647-7000</PhoneNumber>
      <Address hasZip="no" type="work">3 Apple Hill Dr, Natick MA</Address>
   </Entry>
</AddressBook>

Calling xmlread('example.html') on this file returns an error like the one you're seeing because it's not valid XML. To fix this, the two entities that are used (lists of others can be found here) are defined and my xmlentity function is called:

entities = {'nbsp','&#160;';
            'ntilde','&#241;'};
domNode = xmlentity(entities,'example.xml','example2.xml')

This produces the following XML file in 'example2.xml':

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE AddressBook
[
   <!ENTITY nbsp "&#160;">
   <!ENTITY ntilde "&#241;">
]>
<AddressBook>
   <Entry>
      <Name>Frie&ntilde;dly J.&nbsp;Mathworker</Name>
      <PhoneNumber>(508) 647-7000</PhoneNumber>
      <Address hasZip="no" type="work">3 Apple Hill Dr, Natick MA</Address>
   </Entry>
</AddressBook>

Additionally

domNode.getElementsByTagName('Name').item(0).getTextContent

returns 'Frieñdly J. Mathworker'. See the help in xmlentity for further details and caveats.

There are many other ways to deal with this and my code could probably be adapted to use some of them. External DTDs are convenient as they allow you to use one file to declare all of your entities and then you just need to indicate the URI of this file in a simple DTD (and set the XML file to not be standalone). XSLT/Schema is another option. It's much more complicated, but has many more features. Matlab has better support for it too, but it still takes work.

Upvotes: 1

Related Questions