Reputation: 1320
EDIT: I've come up with a solution, here it is for anyone else who may want it. It may be updated in the future if a bug is found or other improvements are added. Last updated on 7/18/2015.
/// <summary>
/// Decodes a string from the specified bytes in the specified encoding.
/// </summary>
/// <param name="Length">Specify -1 to read until null, otherwise, specify the amount of bytes that make up the string.</param>
public static string GetString(byte[] Source, int Offset, int Length, Encoding Encoding)
{
if (Length == 0) return string.Empty;
var sb = new StringBuilder();
if (Length <= -1)
{
using (var sr = new StreamReader(new MemoryStream(Source, Offset, Source.Length - Offset), Encoding, false))
{
int ch;
while (true)
{
ch = sr.Read();
if (ch <= 0) break;
sb.Append((char)ch);
}
if (ch == -1) throw new Exception("End of stream reached; null terminator not found.");
return sb.ToString();
}
}
else return Encoding.GetString(Source, Offset, Length);
}
I am upgrading my application's internal string/Encoding code and I've run into a little implementation issue.
Basically, I wanted to make an easy method, ReadNullTerminatedString. It wasn't too hard to make at first. I used Encoding.IsSingleByte to determine a single character's length, would read the byte(s), check for 0s, and stop reading/continue based on the result.
This is where it gets tricky. UTF8 has variable length encoding. Encoding.IsSingleByte returns false, but that is not always correct since it's a variable encoding and a character can be 1 byte, so my implementation based on Encoding.IsSingleByte wouldn't work for UTF8.
At that point I wasn't sure if that method could be corrected, so I had another idea. Just use the encoding's GetString method on the bytes, use the maximum length the string can be for the count param, and then trim the zeros off the returned string.
That too has a caveat. I have to consider cases where my managed applications will be interacting with byte arrays returned from unmanaged code, cases where there will be a null terminator, of course, but the possibility of having extra junk characters after it. For example: "blah\0\0\oldstring"
ReadNullTerminatedString would be the ideal solution in that case, but at the moment it can't be if I want it to support UTF8. The second solution also will not work - it will trim the 0s, but the junk will remain.
Any ideas for an elegant solution for C#?
Upvotes: 1
Views: 1299
Reputation: 256891
Your best solution is to use an implementation of TextReader
:
StreamReader
if you're reading from a streamStringReader
if you're reading from a stringWith this you can read your source stream of bytes, in whatever encoding you like, and each "character" will come back to you as an int
:
int ch = reader.Read();
Internally the magic is done through the C# Decoder
class (which comes from your Encoding):
var decoder = Encoding.UTF7.GetDecoder();
The Decoder
class needs a short array buffer. Fortunately StreamReader
knows how to keep the buffer filled and everything work.
Untried, untested, and only happens to look like C#:
String ReadNullTerminatedString(Stream stm, Encoding encoding)
{
StringBuilder sb = new StringBuilder();
TextReader rdr = new StreamReader(stm, encoding);
int ch = rdr.Read();
while (ch > 0) //returns -1 when we've hit the end, and 0 is null
{
sb.AppendChar(Char(ch));
int ch = rdr.Read();
}
return sb.ToString();
}
Note: Any code released into public domain. No attribution required.
Upvotes: 1