Encoding and null terminated strings

Question

EDIT: I've come up with a solution, here it is for anyone else who may want it. It may be updated in the future if a bug is found or other improvements are added. Last updated on 7/18/2015.

    /// 
    /// Decodes a string from the specified bytes in the specified encoding.
    /// 
    /// Specify -1 to read until null, otherwise, specify the amount of bytes that make up the string.
    public static string GetString(byte[] Source, int Offset, int Length, Encoding Encoding)
    {
        if (Length == 0) return string.Empty;
        var sb = new StringBuilder();
        if (Length <= -1)
        {
            using (var sr = new StreamReader(new MemoryStream(Source, Offset, Source.Length - Offset), Encoding, false))
            {
                int ch;
                while (true)
                {
                    ch = sr.Read();
                    if (ch <= 0) break;
                    sb.Append((char)ch);
                }
                if (ch == -1) throw new Exception("End of stream reached; null terminator not found.");
                return sb.ToString();
            }
        }
        else return Encoding.GetString(Source, Offset, Length);
    }

I am upgrading my application's internal string/Encoding code and I've run into a little implementation issue.

Basically, I wanted to make an easy method, ReadNullTerminatedString. It wasn't too hard to make at first. I used Encoding.IsSingleByte to determine a single character's length, would read the byte(s), check for 0s, and stop reading/continue based on the result.

This is where it gets tricky. UTF8 has variable length encoding. Encoding.IsSingleByte returns false, but that is not always correct since it's a variable encoding and a character can be 1 byte, so my implementation based on Encoding.IsSingleByte wouldn't work for UTF8.

At that point I wasn't sure if that method could be corrected, so I had another idea. Just use the encoding's GetString method on the bytes, use the maximum length the string can be for the count param, and then trim the zeros off the returned string.

That too has a caveat. I have to consider cases where my managed applications will be interacting with byte arrays returned from unmanaged code, cases where there will be a null terminator, of course, but the possibility of having extra junk characters after it. For example: "blah\0\0\oldstring"

ReadNullTerminatedString would be the ideal solution in that case, but at the moment it can't be if I want it to support UTF8. The second solution also will not work - it will trim the 0s, but the junk will remain.

Any ideas for an elegant solution for C#?

Ian Boyd · Accepted Answer

Your best solution is to use an implementation of TextReader:

StreamReader if you're reading from a stream
StringReader if you're reading from a string

With this you can read your source stream of bytes, in whatever encoding you like, and each "character" will come back to you as an int:

int ch = reader.Read();

Internally the magic is done through the C# Decoder class (which comes from your Encoding):

var decoder = Encoding.UTF7.GetDecoder();

The Decoder class needs a short array buffer. Fortunately StreamReader knows how to keep the buffer filled and everything work.

Pseudocode

Untried, untested, and only happens to look like C#:

String ReadNullTerminatedString(Stream stm, Encoding encoding)
{
   StringBuilder sb = new StringBuilder();

   TextReader rdr = new StreamReader(stm, encoding);
   int ch = rdr.Read(); 
   while (ch > 0) //returns -1 when we've hit the end, and 0 is null
   {
      sb.AppendChar(Char(ch));
      int ch = rdr.Read();
   }
   return sb.ToString();
}

Note: Any code released into public domain. No attribution required.

Encoding and null terminated strings

Answers (1)

Pseudocode

Related Questions