
Reputation: 1298

Parsing concatenated, non-delimited XML messages from TCP-stream using C#

I am trying to parse XML messages which are send to my C# application over TCP. Unfortunately, the protocol can not be changed and the XML messages are not delimited and no length prefix is used. Moreover the character encoding is not fixed but each message starts with an XML declaration <?xml>. The question is, how can i read one XML message at a time, using C#.

Up to now, I tried to read the data from the TCP stream into a byte array and use it through a MemoryStream. The problem is, the buffer might contain more than one XML messages or the first message may be incomplete. In these cases, I get an exception when trying to parse it with XmlReader.Read or XmlDocument.Load, but unfortunately the XmlException does not really allow me to distinguish the problem (except parsing the localized error string).

I tried using XmlReader.Read and count the number of Element and EndElement nodes. That way I know when I am finished reading the first, entire XML message.

However, there are several problems. If the buffer does not yet contain the entire message, how can I distinguish the XmlException from an actually invalid, non-well-formed message? In other words, if an exception is thrown before reading the first root EndElement, how can I decide whether to abort the connection with error, or to collect more bytes from the TCP stream?

If no exception occurs, the XmlReader is positioned at the start of the root EndElement. Casting the XmlReader to IXmlLineInfo gives me the current LineNumber and LinePosition, however it is not straight forward to get the byte position where the EndElement really ends. In order to do that, I would have to convert the byte array into a string (with the encoding specified in the XML declaration), seek to LineNumber,LinePosition and convert that back to the byte offset. I try to do that with StreamReader.ReadLine, but the stream reader gives no public access to the current byte position.

All this seams very inelegant and non robust. I wonder if you have ideas for a better solution. Thank you.

Upvotes: 11

Views: 4097

Answers (4)


Reputation: 2931

The 2 issues that I found were:

  1. XmlReader will only permit an XML declaration at the very beginning. Since it can't be reset it needs to be recreated.
  2. Once XmlReader has done its work it will usually have consumed additional characters after the end of the document because it uses the Read(char[], int, int) method.

My (brittle) workaround is to create a wrapper that only fills the array until a '>' is encountered. This keeps the XmlReader from consuming characters past the ending > of the document it was parsing:

public class SegmentingReader : TextReader {
    private TextReader reader;
    private char trigger;

    public SegmentingReader(TextReader reader, char trigger) {
        this.reader = reader;
        this.trigger = trigger;

    // Dispose omitted for brevity

    public override int Peek() { return reader.Peek(); }

    public override int Read() { return reader.Read(); }

    public override int Read(char[] buffer, int index, int count) {
        int n = 0;
        while (n < count) {
            char ch = (char)reader.Read();
            buffer[index + n] = ch;
            if (ch == trigger) break;
        return n;

Then it can be used as simply as:

using(var inputReader = new SegmentingReader(/*TextReader from somewhere */))
using(var serializer = new XmlSerializer(typeof(SerializedClass)))
while (inputReader.Peek() != -1)
    using (var xmlReader = XmlReader.Create(inputReader)) {
        var obj = serializer.Deserialize(xmlReader.ReadSubtree());

Upvotes: 0


Reputation: 1298

After locking around for some time I think I can answer my own question as following (I might be wrong, corrections are welcome):

  • I found no method so that the XmlReader can continue parsing a second XML message (at least not, if the second message has an XmlDeclaration). XmlTextReader.ResetState could do something similar, but for that I would have to assume the same encoding for all messages. Therefor I could not connect the XmlReader directly to the TcpStream.

  • After closing the XmlReader, the buffer is not positioned at the readers last position. So it is not possible to close the reader and use a new one to continue with the next message. I guess the reason for this is, that the reader could not successfully seek on every possible input stream.

  • When XmlReader throws an exception it can not be determined whether it happened because of an premature EOF or because of a non-wellformed XML. XmlReader.EOF is not set in case of an exception. As workaround I derived my own MemoryBuffer, which returns the very last byte as a single byte. This way I know that the XmlReader was really interested in the last byte and the following exception is likely due to a truncated message (this is kinda sloppy, in that it might not detect every non-wellformed message. However, after appending more bytes to the buffer, sooner or later the error will be detected.

  • I could cast my XmlReader to the IXmlLineInfo interface, which gives access to the LineNumber and the LinePosition of the current node. So after reading the first message I remember these positions and use it to truncate the buffer. Here comes the really sloppy part, because I have to use the character encoding to get the byte position. I am sure you could find test cases for the code below where it breaks (e.g. internal elements with mixed encoding). But up to now it worked for all my tests.

Here is the parser class I came up with -- may it be useful (I know, its very far from perfect...)

class XmlParser {

    private byte[] buffer = new byte[0];

    public int Length { 
        get {
            return buffer.Length;

    // Append new binary data to the internal data buffer...
    public XmlParser Append(byte[] buffer2) {
        if (buffer2 != null && buffer2.Length > 0) {
            // I know, its not an efficient way to do this.
            // The EofMemoryStream should handle a List<byte[]> ...
            byte[] new_buffer = new byte[buffer.Length + buffer2.Length];
            buffer.CopyTo(new_buffer, 0);
            buffer2.CopyTo(new_buffer, buffer.Length);
            buffer = new_buffer;
        return this;

    // MemoryStream which returns the last byte of the buffer individually,
    // so that we know that the buffering XmlReader really locked at the last
    // byte of the stream.
    // Moreover there is an EOF marker.
    private class EofMemoryStream: Stream {
        public bool EOF { get; private set; }
        private MemoryStream mem_;

        public override bool CanSeek {
            get {
                return false;
        public override bool CanWrite {
            get {
                return false;
        public override bool CanRead {
            get {
                return true;
        public override long Length {
            get { 
                return mem_.Length; 
        public override long Position {
            get {
                return mem_.Position;
            set {
                throw new NotSupportedException();
        public override void Flush() {
        public override long Seek(long offset, SeekOrigin origin) {
            throw new NotSupportedException();
        public override void SetLength(long value) {
            throw new NotSupportedException();
        public override void Write(byte[] buffer, int offset, int count) {
            throw new NotSupportedException();
        public override int Read(byte[] buffer, int offset, int count) {
            count = Math.Min(count, Math.Max(1, (int)(Length - Position - 1)));
            int nread = mem_.Read(buffer, offset, count);
            if (nread == 0) {
                EOF = true;
            return nread;
        public EofMemoryStream(byte[] buffer) {
            mem_ = new MemoryStream(buffer, false);
            EOF = false;
        protected override void Dispose(bool disposing) {


    // Parses the first xml message from the stream.
    // If the first message is not yet complete, it returns null.
    // If the buffer contains non-wellformed xml, it ~should~ throw an exception.
    // After reading an xml message, it pops the data from the byte array.
    public Message deserialize() {
        if (buffer.Length == 0) {
            return null;
        Message message = null;

        Encoding encoding = Message.default_encoding;
        //string xml = encoding.GetString(buffer);

        using (EofMemoryStream sbuffer = new EofMemoryStream (buffer)) {

            XmlDocument xmlDocument = null;
            XmlReaderSettings settings = new XmlReaderSettings();

            int LineNumber = -1;
            int LinePosition = -1;
            bool truncate_buffer = false;

            using (XmlReader xmlReader = XmlReader.Create(sbuffer, settings)) {
                try {
                    // Read to the first node (skipping over some element-types.
                    // Don't use MoveToContent here, because it would skip the
                    // XmlDeclaration too...
                    while (xmlReader.Read() &&
                           (xmlReader.NodeType==XmlNodeType.Whitespace || 
                            xmlReader.NodeType==XmlNodeType.Comment)) {

                    // Check for XML declaration.
                    // If the message has an XmlDeclaration, extract the encoding.
                    switch (xmlReader.NodeType) {
                        case XmlNodeType.XmlDeclaration: 
                            while (xmlReader.MoveToNextAttribute()) {
                                if (xmlReader.Name == "encoding") {
                                    encoding = Encoding.GetEncoding(xmlReader.Value);

                    // Move to the first element.

                    if (xmlReader.EOF) {
                        return null;

                    // Read the entire document.
                    xmlDocument = new XmlDocument();
                } catch (XmlException e) {
                    // The parsing of the xml failed. If the XmlReader did
                    // not yet look at the last byte, it is assumed that the
                    // XML is invalid and the exception is re-thrown.
                    if (sbuffer.EOF) {
                        return null;
                    throw e;

                    // Try to serialize an internal data structure using XmlSerializer.
                    Type type = null;
                    try {
                        type = Type.GetType("my.namespace." + xmlDocument.DocumentElement.Name);
                    } catch (Exception e) {
                        // No specialized data container for this class found...
                    if (type == null) {
                        message = new Message();
                    } else {
                        // TODO: reuse the serializer...
                        System.Xml.Serialization.XmlSerializer ser = new System.Xml.Serialization.XmlSerializer(type);
                        message = (Message)ser.Deserialize(new XmlNodeReader(xmlDocument));
                    message.doc = xmlDocument;

                // At this point, the first XML message was sucessfully parsed.

                // Remember the lineposition of the current end element.
                IXmlLineInfo xmlLineInfo = xmlReader as IXmlLineInfo;
                if (xmlLineInfo != null && xmlLineInfo.HasLineInfo()) {
                    LineNumber = xmlLineInfo.LineNumber;
                    LinePosition = xmlLineInfo.LinePosition;

                // Try to read the rest of the buffer.
                // If an exception is thrown, another xml message appears.
                // This way the xml parser could tell us that the message is finished here.
                // This would be prefered as truncating the buffer using the line info is sloppy.
                try {
                    while (xmlReader.Read()) {
                } catch {
                    // There comes a second message. Needs workaround for trunkating.
                    truncate_buffer = true;
            if (truncate_buffer) {
                if (LineNumber < 0) {
                    throw new Exception("LineNumber not given. Cannot truncate xml buffer");
                // Convert the buffer to a string using the encoding found before 
                // (or the default encoding).
                string s = encoding.GetString(buffer);

                // Seek to the line.
                int char_index = 0;
                while (--LineNumber > 0) {
                    // Recognize \r , \n , \r\n as newlines...
                    char_index = s.IndexOfAny(new char[] {'\r', '\n'}, char_index);
                    // char_index should not be -1 because LineNumber>0, otherwise an RangeException is 
                    // thrown, which is appropriate.
                    if (s[char_index-1]=='\r' && s.Length>char_index && s[char_index]=='\n') {
                char_index += LinePosition - 1;

                var rgx = new System.Text.RegularExpressions.Regex(xmlDocument.DocumentElement.Name + "[ \r\n\t]*\\>");
                System.Text.RegularExpressions.Match match = rgx.Match(s, char_index);
                if (!match.Success || match.Index != char_index) {
                    throw new Exception("could not find EndElement to truncate the xml buffer.");
                char_index += match.Value.Length;

                // Convert the character offset back to the byte offset (for the given encoding).
                int line1_boffset = encoding.GetByteCount(s.Substring(0, char_index));

                // remove the bytes from the buffer.
                buffer = buffer.Skip(line1_boffset).ToArray();
            } else {
                buffer = new byte[0];
        return message;

Upvotes: 3

Paul Turner
Paul Turner

Reputation: 39615

Reading into a MemoryStream is not necessary to use an XmlReader. You can attach the reader more directly to the stream to read as much as you require to reach the end of the XML document. A BufferedStream can be utilized to improve the efficiency of reading from the socket directly.

string server = "tcp://myserver"
string message = "GetMyXml"
int port = 13000;
int bufferSize = 1024;

using(var client = new TcpClient(server, port))
using(var clientStream = client.GetStream())
using(var bufferedStream = new BufferedStream(clientStream, bufferSize))
using(var xmlReader = XmlReader.Create(bufferedStream))

            // Check for XML declaration.
            if(xmlReader.NodeType != XmlNodeType.XmlDeclaration)
                throw new Exception("Expected XML declaration.");

            // Move to the first element.

            // Read the root element.
            // Hand this document to another method to process further.
            var xmlDocument = XmlDocument.Load(xmlReader.ReadSubtree());
    catch(XmlException ex)
        // Record exception reading stream.
        // Move reader to start of next document or rethrow exception to exit.

The key to making this work is the call to XmlReader.ReadSubtree() which creates a child reader on top of the parent reader, one that will treat the current element (in this case the root element) as the entire XML tree. This should allow you to parse document elements separately.

My code's a little sloppy around reading the document, especially as I ignore all the information in the XML declaration. I'm sure there's room for improvement, but hopefully this gets you on the right track.

Upvotes: 2

Hans Olsson
Hans Olsson

Reputation: 55001

Assuming that you can change the protocol, I'd suggest adding start and stop markers to the messages, so that when you read it all in as a text stream you can split it up in separate messages (leaving incomplete messages in an "incoming buffer" of some kind), clean up the markers and then you know that you've got exactly one message at the time.

Upvotes: 0

Related Questions