Jeremy
Jeremy

Reputation: 46350

Reading string as a stream without copying

I have some data in a string. I have a function that takes a stream as input. I want to provide my data to my function without having to copy the complete string into a stream. Essentially I'm looking for a stream class that can wrap a string and read from it.

The only suggestions I've seen online suggest the StringReader which is NOT a stream, or creating a memory stream and writing to it, which means copying the data. I could write my own stream object but the tricky part is handling encoding because a stream deals in bytes. Is there a way to do this without writing new stream classes?

I'm implementing pipeline components in BizTalk. BizTalk deals with everything entirely with streams, so you always pass things to BizTalk in a stream. BizTalk will always read from that stream in small chunks, so it doesn't make sense to copy the entire string to a stream (especially if the string is large), if I can read from the stream how BizTalk wants it.

Upvotes: 12

Views: 6988

Answers (4)

xmedeko
xmedeko

Reputation: 7802

Here is a proper StringReaderStream with following drawbacks:

  • The buffer for Read has to be at least maxBytesPerChar long. It's possible to implement Read for small buffers by keeping internal one char buff = new byte[maxBytesPerChar]. But's not necessary for most usages.
  • No Seek, it's possible to do seek, but would be very tricky in general. (Some seek cases, like seek to beginning, seek to end, are simple to implement. )
/// <summary>
/// Convert string to byte stream.
/// <para>
/// Slower than <see cref="Encoding.GetBytes()"/>, but saves memory for a large string.
/// </para>
/// </summary>
public class StringReaderStream : Stream
{
    private string input;
    private readonly Encoding encoding;
    private int maxBytesPerChar;
    private int inputLength;
    private int inputPosition;
    private readonly long length;
    private long position;

    public StringReaderStream(string input)
        : this(input, Encoding.UTF8)
    { }

    public StringReaderStream(string input, Encoding encoding)
    {
        this.encoding = encoding ?? throw new ArgumentNullException(nameof(encoding));
        this.input = input;
        inputLength = input == null ? 0 : input.Length;
        if (!string.IsNullOrEmpty(input))
            length = encoding.GetByteCount(input);
            maxBytesPerChar = encoding == Encoding.ASCII ? 1 : encoding.GetMaxByteCount(1);
    }

    public override bool CanRead => true;

    public override bool CanSeek => false;

    public override bool CanWrite => false;

    public override long Length => length;

    public override long Position
    {
        get => position;
        set => throw new NotImplementedException();
    }

    public override void Flush()
    {
    }

    public override int Read(byte[] buffer, int offset, int count)
    {
        if (inputPosition >= inputLength)
            return 0;
        if (count < maxBytesPerChar)
            throw new ArgumentException("count has to be greater or equal to max encoding byte count per char");
        int charCount = Math.Min(inputLength - inputPosition, count / maxBytesPerChar);
        int byteCount = encoding.GetBytes(input, inputPosition, charCount, buffer, offset);
        inputPosition += charCount;
        position += byteCount;
        return byteCount;
    }

    public override long Seek(long offset, SeekOrigin origin)
    {
        throw new NotImplementedException();
    }

    public override void SetLength(long value)
    {
        throw new NotImplementedException();
    }

    public override void Write(byte[] buffer, int offset, int count)
    {
        throw new NotImplementedException();
    }
}

Upvotes: 9

dbc
dbc

Reputation: 116721

While this question was originally tagged , this can be done fairly easily in .NET 5 with the introduction of Encoding.CreateTranscodingStream:

Creates a Stream that serves to transcode data between an inner Encoding and an outer Encoding, similar to Convert(Encoding, Encoding, Byte[]).

The trick is to define an underlying UnicodeStream that directly accesses the bytes of the string then wrap that in the transcoding stream to present streamed content with the required encoding.

The following classes and extension method do the job:

public static partial class TextExtensions
{
    public static Encoding PlatformCompatibleUnicode => BitConverter.IsLittleEndian ? Encoding.Unicode : Encoding.BigEndianUnicode;
    static bool IsPlatformCompatibleUnicode(this Encoding encoding) => BitConverter.IsLittleEndian ? encoding.CodePage == 1200 : encoding.CodePage == 1201;
    
    public static Stream AsStream(this string @string, Encoding encoding = default) => 
        (@string ?? throw new ArgumentNullException(nameof(@string))).AsMemory().AsStream(encoding);
    public static Stream AsStream(this ReadOnlyMemory<char> charBuffer, Encoding encoding = default) =>
        ((encoding ??= Encoding.UTF8).IsPlatformCompatibleUnicode())
            ? new UnicodeStream(charBuffer)
            : Encoding.CreateTranscodingStream(new UnicodeStream(charBuffer), PlatformCompatibleUnicode, encoding, false);
}

sealed class UnicodeStream : Stream
{
    const int BytesPerChar = 2;

    // By sealing UnicodeStream we avoid a lot of the complexity of MemoryStream.
    ReadOnlyMemory<char> charMemory;
    int position = 0;
    Task<int> _cachedResultTask; // For async reads, avoid allocating a Task.FromResult<int>(nRead) every time we read.

    public UnicodeStream(string @string) : this((@string ?? throw new ArgumentNullException(nameof(@string))).AsMemory()) { }
    public UnicodeStream(ReadOnlyMemory<char> charMemory) => this.charMemory = charMemory;

    public override int Read(Span<byte> buffer)
    {
        EnsureOpen();
        var charPosition = position / BytesPerChar;
        // MemoryMarshal.AsBytes will throw on strings longer than int.MaxValue / 2, so only slice what we need. 
        var byteSlice = MemoryMarshal.AsBytes(charMemory.Slice(charPosition, Math.Min(charMemory.Length - charPosition, 1 + buffer.Length / BytesPerChar)).Span);
        var slicePosition = position % BytesPerChar;
        var nRead = Math.Min(buffer.Length, byteSlice.Length - slicePosition);
        byteSlice.Slice(slicePosition, nRead).CopyTo(buffer);
        position += nRead;
        return nRead;
    }

    public override int Read(byte[] buffer, int offset, int count) 
    {
        ValidateBufferArgs(buffer, offset, count);
        return Read(buffer.AsSpan(offset, count));
    }

    public override int ReadByte()
    {
        // Could be optimized.
        Span<byte> span = stackalloc byte[1];
        return Read(span) == 0 ? -1 : span[0];
    }

    public override ValueTask<int> ReadAsync(Memory<byte> buffer, CancellationToken cancellationToken = default)
    {
        EnsureOpen();
        if (cancellationToken.IsCancellationRequested) 
            return ValueTask.FromCanceled<int>(cancellationToken);
        try
        {
            return new ValueTask<int>(Read(buffer.Span));
        }
        catch (Exception exception)
        {
            return ValueTask.FromException<int>(exception);
        }   
    }
    
    public override Task<int> ReadAsync(byte[] buffer, int offset, int count, CancellationToken cancellationToken)
    {
        ValidateBufferArgs(buffer, offset, count);
        var valueTask = ReadAsync(buffer.AsMemory(offset, count));
        if (!valueTask.IsCompletedSuccessfully)
            return valueTask.AsTask();
        var lastResultTask = _cachedResultTask;
        return (lastResultTask != null && lastResultTask.Result == valueTask.Result) ? lastResultTask : (_cachedResultTask = Task.FromResult<int>(valueTask.Result));
    }

    void EnsureOpen()
    {
        if (position == -1)
            throw new ObjectDisposedException(GetType().Name);
    }
    
    // https://learn.microsoft.com/en-us/dotnet/api/system.io.stream.flush?view=net-5.0
    // In a class derived from Stream that doesn't support writing, Flush is typically implemented as an empty method to ensure full compatibility with other Stream types since it's valid to flush a read-only stream.
    public override void Flush() { }
    public override Task FlushAsync(CancellationToken cancellationToken) => cancellationToken.IsCancellationRequested ? Task.FromCanceled(cancellationToken) : Task.CompletedTask;
    public override bool CanRead => true;
    public override bool CanSeek => false;
    public override bool CanWrite => false;
    public override long Length => throw new NotSupportedException();
    public override long Position { get => throw new NotSupportedException(); set => throw new NotSupportedException(); }
    public override long Seek(long offset, SeekOrigin origin) => throw new NotSupportedException();
    public override void SetLength(long value) => throw new NotSupportedException();
    public override void Write(byte[] buffer, int offset, int count) =>  throw new NotSupportedException();
    
    protected override void Dispose(bool disposing)
    {
        try 
        {
            if (disposing) 
            {
                _cachedResultTask = null;
                charMemory = default;
                position = -1;
            }
        }
        finally 
        {
            base.Dispose(disposing);
        }
    }   
    
    static void ValidateBufferArgs(byte[] buffer, int offset, int count)
    {
        if (buffer == null)
            throw new ArgumentNullException(nameof(buffer));
        if (offset < 0 || count < 0)
            throw new ArgumentOutOfRangeException();
        if (count > buffer.Length - offset)
            throw new ArgumentException();
    }
}   

Notes:

  • You can stream from either a string, a char [] array, or slices thereof by converting them to ReadOnlyMemory<char> buffers. This conversion simply wraps the underlying string or array memory without allocating anything.

  • Solutions that use Encoding.GetBytes() to encode chunks of a string are broken because they will not handle surrogate pairs that are split between chunks. To handle surrogate pairs correctly, Encoding.GetEncoder() must be called to initially save a Encoder. Later, Encoder.GetBytes(ReadOnlySpan<Char>, Span<Byte>, flush: false) can be used to encode in chucks and remember state between calls.

    (Microsoft's TranscodingStream does this correctly.)

  • You will get the best performance by using Encoding.Unicode as (on almost all .Net platforms) this encoding is identical to the encoding of the String type itself.

    When a platform-compatible Unicode encoding is supplied no TranscodingStream is used and the returned Stream reads from the character data buffer directly.

  • To do:

    • Test on big-endian platforms (which are rare).
    • Test on strings longer than int.MaxValue / 2.

Demo fiddle including some basic tests here.

Upvotes: 4

Peter Ritchie
Peter Ritchie

Reputation: 35881

Stream can only copy data. In addition, it deals in bytes, not chars so you'll have to copy data via the decoding process. But, If you want to view a string as a stream of ASCII bytes, you could create a class that implements Stream to do it. For example:

public class ReadOnlyStreamStringWrapper : Stream
{
    private readonly string theString;

    public ReadOnlyStreamStringWrapper(string theString)
    {
        this.theString = theString;
    }

    public override void Flush()
    {
        throw new NotSupportedException();
    }

    public override long Seek(long offset, SeekOrigin origin)
    {
        switch (origin)
        {
            case SeekOrigin.Begin:
                if(offset < 0 || offset >= theString.Length)
                    throw new InvalidOperationException();

                Position = offset;
                break;
            case SeekOrigin.Current:
                if ((Position + offset) < 0)
                    throw new InvalidOperationException();
                if ((Position + offset) >= theString.Length)
                    throw new InvalidOperationException();

                Position += offset;
                break;
            case SeekOrigin.End:
                if ((theString.Length + offset) < 0)
                    throw new InvalidOperationException();
                if ((theString.Length + offset) >= theString.Length)
                    throw new InvalidOperationException();
                Position = theString.Length + offset;
                break;
        }

        return Position;
    }

    public override void SetLength(long value)
    {
        throw new NotSupportedException();
    }

    public override int Read(byte[] buffer, int offset, int count)
    {
        return Encoding.ASCII.GetBytes(theString, (int)Position, count, buffer, offset);
    }

    public override void Write(byte[] buffer, int offset, int count)
    {
        throw new NotSupportedException();
    }

    public override bool CanRead
    {
        get { return true; }
    }

    public override bool CanSeek
    {
        get { return true; }
    }

    public override bool CanWrite
    {
        get { return false; }
    }

    public override long Length
    {
        get { return theString.Length; }
    }

    public override long Position { get; set; }
}

But, that's a lot of work to avoid "copying" data...

Upvotes: -1

C.Evenhuis
C.Evenhuis

Reputation: 26446

You can prevent having to maintain a copy of the whole thing, but you would be forced to use an encoding that results in the same number of bytes for each character. That way you could provide chunks of data via Encoding.GetBytes(str, strIndex, byteCount, byte[], byteIndex) as they're being requested straight into the read buffer.

There will always be one copy action per Stream.Read() operation, because it lets the caller provide the destination buffer.

Upvotes: 1

Related Questions