Nico Schlömer
Nico Schlömer

Reputation: 58901

read XML with binary content

I would like to use Python to read a VTU file which is XML and may raw contain binary data. The specification says:

There is one case in which the file is not a valid XML document. When the AppendedData section is not encoded as base64, raw binary data is present that may violate the XML specification. This is not default behavior, and must be explicitly enabled by the user.

For example, check dragon.vtu:

<VTKFile type="UnstructuredGrid" version="1.0" byte_order="LittleEndian" header_type="UInt64">
  <UnstructuredGrid>
    <Piece NumberOfPoints="69827" NumberOfCells="139650">
      <Cells>
        <DataArray type="Int64" Name="connectivity" format="appended" RangeMin="" RangeMax="" offset="837932"/>
        <DataArray type="Int64" Name="offsets" format="appended" RangeMin="" RangeMax="" offset="4189540"/>
        <DataArray type="UInt8" Name="types" format="appended" RangeMin="" RangeMax="" offset="5306748"/>
      </Cells>
    </Piece>
  </UnstructuredGrid>
  <AppendedData encoding="raw">
   _$É�����ıAdÌAÁÊÃÿ@>yAn£GÁÏAA(~AÁþ`AF¶Áo.@Ô«¬A3Ä|Ásc2@ï8±A cÁÉX@®AZ/AϱÁ:»AA)³Á(ÉAs!AFÁ\A½A*ÁyA*)AéÔÁØÓAÀ¡Aã_ÁóA`öBÌ]gADé¸AdBdÌnA|r·AhB^ºnA­zºAȦ
   [...]

Naively doing

import xml.etree.ElementTree as ET
parser = ET.XMLParser()
tree = ET.parse("dragon.vtu", parser)

does not work:

Traceback (most recent call last):
  File "f.py", line 3, in <module>
    tree = ET.parse("dragon.vtu", parser)
  File "/usr/lib/python3.7/xml/etree/ElementTree.py", line 1197, in parse
    tree.parse(source, parser)
  File "/usr/lib/python3.7/xml/etree/ElementTree.py", line 604, in parse
    parser.feed(data)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 28, column 5

Any hints?

Upvotes: 2

Views: 4785

Answers (1)

kjhughes
kjhughes

Reputation: 111726

The problem is that your data is not XML due to consisting of illegal characters, therefore any conformant XML parser will properly reject it.

Fix the problem upstream: Rather than embedding binary data directly, first encoded as Base64.

See also


I cannot fix the problem upstream...

Then you're in the unfortunate position of having received data that is not XML. See the following for your options: How to parse invalid (bad / not well-formed) XML?

...since the binary content is part of the VTU specification.

Any specification that includes unconstrained binary data in XML is broken as designed. Your options include those of parsing bad XML (see above link), using only the culprit's provided libraries/toolkits, or writing your own library/toolkit – not great options, but such are the consequences of vendors not following the XML specification.

Upvotes: 4

Related Questions