Reputation: 14964
What's the fastest way to turn a string into a byte[] array in C#? I'm sending tonnes of string data through sockets and need to optimize every single operation. Currently I transform the strings in to byte[] arrays before sending using:
private static readonly Encoding encoding = new ASCIIEncoding();
//...
byte[] bytes = encoding.GetBytes(someString);
socket.Send(bytes);
//...
Upvotes: 9
Views: 13781
Reputation: 18749
The performance characteristic of implementing a general-purpose memcpy library function using a SIMD register is significantly more colorful than an equivalent implementation using a general-purpose register...
- Intel 64 and IA-32 Architectures Optimization Reference Manual (April 2018) §3.7.6.1
For screaming speed with converting medium- to larger-sized chunks of data between 8-bit byte[]
and "wide" (16-bit, Unicode) text, you'll want to consider solutions which deploy the SIMD instructions PUNPCKLBW
+PUNPCKHBW
(widening) and PACKUSWB
(narrowing). In .NET, these are available as x64 JIT intrinstics, emitted for the hardware-accelerated System.Numerics
types Vector
and Vector<T>
(see here for more info). The generic version Vector<T>
is defined in the System.Numerics.Vectors
package, which currently remains under fairly active development. As illustrated below, you'll also probably want to include the System.Runtime.CompilerServices.Unsafe
package, since this is preferred SIMD load/store technique recommended by the Vector<T>
authors.
The relevant SIMD acceleration is only enabled for capable CPUs in x64 mode, but otherwise .NET provides transparent fallback to emulation code in the System.Numerics.Vectors
library, so the code demonstrated here does indeed reliably function across the wider .NET ecosystem, possibly with reduced performance. To test the code shown below, I used a console app on the full .NET Framework 4.8 ("desktop") in x64 (SIMD) and x86 (emulated) modes.
Since I wouldn't want to deprive anyone of the opportunity to learn the relevant techniques, I'll use Vector.Widen
to illustrate the byte[]
to char[]
direction in C# 7. From this example, doing the reverse--i.e, using Vector.Narrow
to implement the narrowing direction--is straightforward and is left as an exercise for the reader.
warning:
The methods suggested here are entirely encoding-unaware, they simply strip/expand--or narrow/widen--raw bytes to/from raw bytes without any regard for character mapping, text encoding, or other linguistic properties. When widening, surplus bytes are set to zero, and when narrowing, excess bytes are discarded.
Others have discussed the n̲u̲m̲e̲r̲o̲u̲s̲ h̲a̲z̲a̲r̲d̲s̲ associated with this practice on this page and elsewhere, so please carefully review and understand the nature of this operation before considering whether it is appropriate for your situation. For clarity, inline validation is elided from the code example shown below, but such could be added to the innermost loop with minimal impact on the SIMD benefit.
You have been warned. Although not SIMD-accelerated, canonical techniques using a suitable Encoding
instance are recommended for nearly all realistic app scenarios. Although the OP specifically requests a maximum-performance solution, first I'll summarize the proper sanctioned techniques that should normally be used instead.
To widen a byte array to a .NET
String
, invoke the GetString() method on a suitable byte-oriented encoding instance:
String Encoding.ASCII.GetString(byte[] bytes)
To narrow a .NET
String
to an (e.g., Ascii) byte array, invoke the GetBytes() method on a suitable byte-oriented encoding instance:
byte[] Encoding.ASCII.GetBytes(char[] chars)
Ok, now on to the fun part--the extremely fast SIMD-enabled ("vectorized") C# code for "dumb" widening of a byte array. As a reminder, here are some dependencies that should be referenced:
// ...
using System.Numerics; // nuget: System.Numerics.Vectors
using System.Runtime.CompilerServices; // nuget: System.Runtime.CompilerServices.Unsafe
// ...
Here is the public entry point wrapper function. If you prefer a version that returns char[]
instead of String
, it's provided at the end of this post.
/// <summary>
/// 'Widen' each byte in 'bytes' to 16-bits with no consideration for
/// character mapping or encoding.
/// </summary>
public static unsafe String ByteArrayToString(byte[] bytes)
{
// note: possible zeroing penalty; consider buffer pooling or
// other ways to allocate target?
var s = new String('\0', bytes.Length);
if (s.Length > 0)
fixed (char* dst = s)
fixed (byte* src = bytes)
widen_bytes_simd(dst, src, s.Length);
return s;
}
Next is the main working loop body. Notice the prologue loop that aligns the destination to a 16-byte memory boundary, if necessary, by bytewise copying of up to 15 source bytes. This ensures the most efficient operation of the main "quad-quadwise" loop which, with a single pairing of SIMD PUNPCKLBW/PUNPCKHBW
instructions, writes 32-bytes at once (16 source bytes are fetched and then stored as 16 wide chars occupying 32 bytes).
Pre-aligning to a quad-quad physical boundary—with priority going to the destination address dst
over the source src
if they aren't co-aligned—(as shown here) are both official recommendations from the Intel manual cited above. And regardless of aligned operation of course, any chunked transfer of 16-bytes at a time can potentially leave zero to 15 residual trailing bytes when the main loop completes; these are finished up by a short epilogue loop.
static unsafe void widen_bytes_simd(char* dst, byte* src, int c)
{
for (; c > 0 && ((long)dst & 0xF) != 0; c--)
*dst++ = (char)*src++;
for (; (c -= 0x10) >= 0; src += 0x10, dst += 0x10)
Vector.Widen(Unsafe.AsRef<Vector<byte>>(src),
out Unsafe.AsRef<Vector<ushort>>(dst + 0),
out Unsafe.AsRef<Vector<ushort>>(dst + 8));
for (c += 0x10; c > 0; c--)
*dst++ = (char)*src++;
}
That's actually all there is to it! It works like a charm and, as you'll see below, it does 'scream' as-advertised.
But first, by turning off the vs2017 debugger option "Disable JIT optimizations," we can examine the native SIMD instruction stream that the x64 JIT generates for the 'release' build on .NET 4.7.2. Here is the relevant part of the main inner loop that blasts through the data 32-bytes at a time. Notice that the JIT has managed to emit the theoretically minimal fetch/store pattern.
L_4223 mov rax,rbx
L_4226 movups xmm0,xmmword ptr [rax] ; fetch 16 bytes
L_4229 mov rax,rdi
L_422C lea rdx,[rdi+10h]
L_4230 movaps xmm2,xmm0
L_4233 pxor xmm1,xmm1
L_4237 punpcklbw xmm2,xmm1 ; interleave 8-to-16 bits (lo)
L_423B movups xmmword ptr [rax],xmm2 ; store 8 bytes (lo) to 8 wide chars (16 bytes)
L_423E pxor xmm1,xmm1
L_4242 punpckhbw xmm0,xmm1 ; interleave 8-to-16 bits (hi)
L_4246 movups xmmword ptr [rdx],xmm0 ; store 8 bytes (hi) to 8 wide chars (16 bytes)
L_4249 add rbx,10h
L_424D add rdi,20h
L_4251 add esi,0FFFFFFF0h
L_4254 test esi,esi
L_4256 jge L_4223
L_4258 ...
Performance test results:
I tested the SIMD code against four other techniques that perform the same function. For the .NET encoders listed below, this was a call to the GetChars(byte[], int, int)
method.
The testing included identical work for all and validation of identical results from all of the units under test. Test bytes were random and ASCII-only ([0x01 - 0x7F]
) in order to ensure identical result from all test units. Input size was random, maximum 1MB, with a log2 bias towards smaller sizes such that the average size was about 80K.
For fairness, execution order was systematically rotated through the 5 units for each iteration. For warmup, timings were discarded and reset to zero once, at iteration 100. The test harness does not perform any allocations during the test phase and a full GC is forced and awaited each 10000 iterations.
Relative ticks, normalized to best result .NET Framework 4.7.3056.0 x64 (release) iter naive win-1252 ascii utf-8 simd ------- ----------- ------------ ------------ ------------ ----------- 10000 | 131.5 294.5 186.2 145.6 100.0 20000 | 137.7 305.3 191.9 149.4 100.0 30000 | 139.2 308.5 195.8 151.5 100.0 40000 | 141.8 312.1 198.5 153.2 100.0 50000 | 142.0 313.8 199.1 154.1 100.0 60000 | 140.5 310.6 196.7 153.0 100.0 70000 | 141.1 312.9 197.3 153.6 100.0 80000 | 141.6 313.7 197.8 154.1 100.0 90000 | 141.3 313.7 197.9 154.3 100.0 100000 | 141.1 313.3 196.9 153.7 100.0 gcServer=False; LatencyMode.Interactive; Vector.IsHardwareAccelerated=True
On the preferred x64 platform when JIT optimization is enabled and SIMD is available, there was no contest. The SIMD code runs about 150% faster than the next contender. The Encoding.Default
, which is usually the "Windows-1252" codepage, performed particularly poorly, about 3x slower than the SIMD code.
Earlier I mentioned that the distribution of test data sizes was strongly log-biased towards zero. Without this step--meaning a uniform distribution of sizes from 0 to 1,048,576 bytes (average test size 512K)--SIMD continues to outpace the pack with all other units faring relatively worse vs. the code shown above.
naive 153.45% win-1252 358.84% ascii 221.38% utf-8 161.62% simd 100.00%
As for the non-SIMD (emulation) case, UTF-8 and SIMD are extremely close--typically within 3-4% of each other--and far better than the rest. I found this result to be doubly surprising: that the UTF8Encoding source code was so fast (lots of fast-path optimization), and then also that the general-purpose SIMD emulation code was able to match that purpose-tuned code.
/// <summary>
/// 'Widen' each byte in 'bytes' to 16-bits with no consideration for
/// character mapping or encoding
/// </summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static unsafe char[] WidenByteArray(byte[] bytes)
{
var rgch = new char[bytes.Length];
if (rgch.Length > 0)
fixed (char* dst = rgch)
fixed (byte* src = bytes)
widen_bytes_simd(dst, src, rgch.Length);
return rgch;
}
Upvotes: 6
Reputation: 505
This is a micro optimization which runs about 10% faster than UTF8.GetBytes()
Here are the benchmark results. I've used string to char array to bytes exploiting that a consistent byte representation of a UTF8 char can under almost all circumstances be obtained by interpreting the char as two bytes via a UInt16 pointer Unfortunately the reverse bytes to char array to string is slower.
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1645 (21H2)
AMD Ryzen Threadripper 3970X, 1 CPU, 64 logical and 32 physical cores
.NET SDK=6.0.202
[Host] : .NET 6.0.4 (6.0.422.16404), X64 RyuJIT [AttachedDebugger]
.NET 6.0 : .NET 6.0.4 (6.0.422.16404), X64 RyuJIT
Job=.NET 6.0 Runtime=.NET 6.0
| Method | numRepititions | TestType | numThreads | stringSize | Mean | Error | StdDev | Gen 0 | Completed Work Items | Lock Contentions | Allocated |
|---------------------- |--------------- |-------------------- |----------- |----------- |---------:|--------:|--------:|------------:|---------------------:|-----------------:|----------:|
| BenchSerialize_String | 10000000 | UTF8.GetBytes | 1 | 64 | 497.9 ms | 5.04 ms | 4.72 ms | 105000.0000 | - | - | 839 MB |
| BenchSerialize_String | 10000000 | StringToByteViaChar | 1 | 64 | 445.5 ms | 3.81 ms | 3.18 ms | 181000.0000 | - | - | 1,450 MB |
| BenchSerialize_String | 10000000 | UTF8.GetString | 1 | 64 | 349.2 ms | 5.15 ms | 5.28 ms | 286000.0000 | - | - | 2,289 MB |
| BenchSerialize_String | 10000000 | ByteToStringViaChar | 1 | 64 | 505.1 ms | 1.82 ms | 1.52 ms | 363000.0000 | - | - | 2,899 MB |
The code is quoted from: https://github.com/tcwicks/ChillX/blob/master/src/ChillX.Serialization/BitConverterExtended.cs
/// <summary>
/// Serializes a char array into a byte array and assigns it to buffer at offset specified by startIndex
/// </summary>
/// <param name="array">array to be serialized</param>
/// <param name="buffer">buffer to write to</param>
/// <param name="startIndex">buffer offset</param>
/// <returns>number of bytes written</returns>
/// <exception cref="ArgumentNullException">Buffer cannot be null</exception>
/// <exception cref="ArgumentOutOfRangeException">start index and length of data must fit in range of buffer byte array</exception>
public unsafe static int GetBytes(char[] array, byte[] buffer, int startIndex)
{
if (buffer == null)
{
throw new ArgumentNullException("value cannot be null");
}
if (startIndex < 0)
{
throw new ArgumentOutOfRangeException(@"startIndex must be >= 0");
}
if (buffer.Length < (startIndex + (array == null ? 0 : array.Length * 2)))
{
throw new ArgumentOutOfRangeException(@"startIndex + length is out of range");
}
Contract.EndContractBlock();
if (array == null) { return 0; }
int len = array.Length;
fixed (char* pArr = &array[0])
{
fixed (byte* b = &buffer[startIndex])
{
for (int I = 0; I < len; I++)
{
*((short*)b + I) = *((short*)pArr + I);
}
}
}
return len * 2;
}
Upvotes: 0
Reputation:
Just another tip : I don't know how you create your initial Strings, but remember that StringBuilder.Append("something") is really faster than something like myString += "something".
In the whole process of creating the strings, and sending them through a socket connection, I would be surprized if the bottleneck was the conversion of Strings into byte arrays. But I'm very interested if someone would test this with a profiler.
Upvotes: 0
Reputation: 13055
I'd suggest profiling what you're doing. I find it doubtful that the speed of converting a string to a byte array is a larger problem in throughput than the speed of the socket itself.
Upvotes: -1
Reputation: 8531
With no clue to your concurrency requirements (or anything else): Can you spawn some threads on the ThreadPool that convert the strings to byte arrays and drop them into a Queue, and have one more thread watching the Queue and sending the data?
Upvotes: 1
Reputation: 22220
I imagine the GetBytes() function is already well optimized for this. I can't think of any suggestions to improve the speed of your existing code.
EDIT -- You know, I don't know if this is faster or not. But here's another method using the BinaryFormatter:
BinaryFormatter bf = new BinaryFormatter();
MemoryStream ms = new MemoryStream();
bf.Serialize(ms, someString);
byte[] bytes = ms.ToArray();
ms.Close();
socket.Send(bytes);
The reason I think this might be faster is that it skips the encoding step. I'm also not entirely sure this will work properly. But you might try it and see. Of course, if you need the ascii encoding then this won't help.
I just had another thought. I believe this code would return double the number of bytes than using GetBytes with ASCII encoding. The reason is that all strings in .NET use unicode behind the scenes. And of course Unicode uses 2 bytes per character, whereas ASCII uses just 1. So the BinaryFormatter is probably not the thing to use in this case because you'd be doubling the amount of data you're sending over the socket.
Upvotes: 1
Reputation: 1500435
If all your data is really going to be ASCII, then you may be able to do it slightly faster than ASCIIEncoding
, which has various (entirely reasonable) bits of error handling etc. You may also be able to speed it up by avoiding creating new byte arrays all the time. Assuming you have an upper bound which all your messages will be under:
void QuickAndDirtyAsciiEncode(string chars, byte[] buffer)
{
int length = chars.Length;
for (int i = 0; i < length; i++)
{
buffer[i] = (byte) (chars[i] & 0x7f);
}
}
You'd then do something like:
readonly byte[] Buffer = new byte[8192]; // Reuse this repeatedly
...
QuickAndDirtyAsciiEncode(text, Buffer);
// We know ASCII takes one byte per character
socket.Send(Buffer, text.Length, SocketFlags.None);
This is pretty desperate optimisation though. I'd stick with ASCIIEncoding
until I'd proven that this was the bottleneck (or at least that this sort of grotty hack doesn't help).
Upvotes: 16
Reputation: 292415
As others have said, the Encoding class is already optimized for that task, so it will probably be hard to make it faster. There's one micro-optimization that you could do : use Encoding.ASCII
rather than new ASCIIEncoding()
. But as everyone knows, micro-optimizations are bad ;)
Upvotes: 0
Reputation: 10776
What are you trying to optimize for? CPU? Bandwidth?
If you're to optimize bandwidth, you could try compressing the string data beforehand.
First, profile your code, figure out what the slow bits are, before you try to optimize at such a low level.
Upvotes: 1
Reputation: 32960
I would say that how you are doing it now is plenty good. If you are really concerned with very low level optimization like that, the best recommendation I can make is get Reflector. With reflector, you can look at the code yourself (most of the time), and see what the algorithms are. If reflector does not show you, you could always download Microsofts SSCLI (Shared Source Common Language Infrastructure) to see the C++ code behind MethodImplOptions.InternalCall methods.
For reference, here is the actual implementation of Encoding.ASCII.GetBytes:
public override int GetBytes(string chars, int charIndex, int charCount, byte[] bytes, int byteIndex)
{
if ((chars == null) || (bytes == null))
{
throw new ArgumentNullException();
}
if ((charIndex < 0) || (charCount < 0))
{
throw new ArgumentOutOfRangeException();
}
if ((chars.Length - charIndex) < charCount)
{
throw new ArgumentOutOfRangeException();
}
if ((byteIndex < 0) || (byteIndex > bytes.Length))
{
throw new ArgumentOutOfRangeException();
}
if ((bytes.Length - byteIndex) < charCount)
{
throw new ArgumentException();
}
int num = charIndex + charCount;
while (charIndex < num)
{
char ch = chars[charIndex++];
if (ch >= '\x0080')
{
ch = '?';
}
bytes[byteIndex++] = (byte) ch;
}
return charCount;
}
Upvotes: 9