heltonbiker
heltonbiker

Reputation: 27605

How to read XML data from the header of a mixed xml/binary file in C#

I have the task to write a reader for a file format with the following specification:

  1. First section is plain xml with metadata (utf-8);
  2. Last section is a stream of 16bit values (binary);
  3. These two sections are separated by one byte with value 29 (group separator in the ASCII table).

I see two ways to read the xml part of the file. The first one is to build a string byte by byte until I find the separator.

The other is to use some library that would parse the xml and automatically detect the end of well-formed xml.

The question is: is there any .NET library that would stop automatically after the last closing tag in the XML?

(or, can anyone suggest a saner way to read this kind of file format?)


UPDATE: Following the answer from Peter Duniho, with slight modifications, I ended up with this (it works, though not thoroughly unit-tested yet).

        int position = 0;
        MemoryStream ms;

        using (FileStream fs = File.OpenRead("file.xml"))
        using (ms = new MemoryStream())
        {
            int current;
            while ((current = fs.ReadByte()) > 0)
            {
                position++;

                if (current == 29)
                    break;

                ms.WriteByte((byte)current);
            }
        }

        var xmlheader = new XmlDocument();
        xmlheader.LoadXml(Encoding.UTF8.GetString(ms.ToArray()));

Upvotes: 2

Views: 1411

Answers (2)

Peter Duniho
Peter Duniho

Reputation: 70691

Given the information you've provided, simply searching for the byte with value 29 should work , because XML is UTF8 and a byte of value 29 should appear only if the character code point of 29 is present in the file. Now, I guess it could be present, but it would be surprising since that's in the control character range of the ASCII values.

From the XML 1.0 spec:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

While the comment implies 29 would be a valid codepoint in an XML file (since it is itself a valid Unicode character), I consider the actual grammar normative. I.e. it specifically excludes characters below codepoint 32 except tab, newline, and carriage return, so 29 is not a valid XML character (just as Jon Skeet said).

That said, without a complete specification of the input, I can't rule out the possibility. So if you really want to be on the safe side, you'd have to go ahead and parse the XML, hoping to find a proper closing tag for the root element. Then you can search for the byte 29 (since there might be whitespace after the closing tag), to identify where the binary data starts.

(Note: asking for a library is "off-topic". But you might be able to use XmlReader to do this, since it operates on an iterative basis; i.e. you can terminate its operation after you hit the final closing tag, and before it starts complaining about finding invalid XML. This would depend, however, on buffering that XmlReader might do; if it buffers additional data past the closing tag, then the position of the underlying stream would be past the 29 byte, making it harder to find. Frankly, just searching for the 29 byte seems like the way to go).

You could search the header for the 29 byte like this (warning: browser code...uncompiled, untested):

MemoryStream xmlStream = new MemoryStream();

using (FileStream stream = File.OpenRead(path))
{
    int offset = 0, bytesRead = 0;

    // arbitrary size...whatever you think is reasonable would be fine
    byte[] buffer = new byte[1024];

    while ((bytesRead = stream.Read(buffer, 0, buffer.Length)) > 0)
    {
        bool found = false;

        for (int i = 0; i < bytesRead; i++)
        {
            if (buffer[i] == 29)
            {
                offset += i;
                found = true;
                xmlStream.Write(buffer, 0, i - 1);
                break;
            }
        }

        if (found)
        {
            break;
        }

        offset += bytesRead;
        xmlStream.Write(buffer, 0, bytesRead);
    }

    if (bytesRead > 0)
    {
        // found byte 29 at offset "offset"

        xmlStream.Position = 0;

        // pass "xmlStream" object to your preferred XML-parsing API to
        // parse the XML, or just return it or "xmlStream.ToArray()" as
        // appropriate to the caller to let the caller deal with it.
    }
    else
    {
        // byte 29 not found!
    }
}

EDIT:

I've updated the above code example to write to a MemoryStream object, so that once you've found the byte 29 value, you've got a stream all ready to go for XML parsing. Of course, I'm sure you could have added that yourself if you really needed to. In any case, obviously you would modify the code, with or without that feature, to suit your needs.

(There is the obvious hazard in writing to the MemoryStream as you search: if you don't ever find the byte 29 value, you'll wind up with a complete copy of the entire file in memory, which you'd suggested you might prefer to avoid. But given that that's the error scenario, that might be okay).

Upvotes: 2

Jon Skeet
Jon Skeet

Reputation: 1502106

While the "read to the closing tag" sounds appealing, you'd need to have a parser which didn't end up buffering all the data.

I would read all the data into a byte[], then search for the separator there - then you can split the binary data into two, and parse each part appropriately. I would do that entirely working in binary, with no strings involved - you can create a MemoryStream for each section using new MemoryStrem(byte[], int, int) and then pass that to an XML parser and whatever your final section parser is. That way you don't need to worry about handling UTF-8, or detecting if a later version of the XML doesn't use UTF-8, etc.

So something like:

byte[] allData = File.ReadAllBytes(filename);
int separatorIndex = Array.IndexOf(allData, (byte) 29);
if (separatorIndex == -1)
{
    // throw an exception or whatever
}
var xmlStream = new MemoryStream(allData, 0, separatorIndex);
var lastPartStream = new MemoryStream(
      allData, separatorIndex + 1, allData.Length - separatorIndex - 1);

Upvotes: 2

Related Questions