Reputation: 1886
I wrote the following to convert a byte array data
to a string array hex
containing 32 bytes as hex string per entry to write them to a file.
byte[] data = new byte[4*1024*1024];
string[] hex = data.Select((b) => b.ToString("X2")).ToArray();
hex = Enumerable.Range(0, data.Length / 32).Select((r) => String.Join(" ", hex.Skip(r * 32).Take(32))).ToArray(); // <= This line takes forever
The problem was that it took minutes(!) to finish although the resulting file was less than 20MB. So I tried to optimize it and came up with the following:
byte[] data = new byte[4*1024*1024];
string[] hex = new string[4*1024*1024/32];
for (var i = 0; i <= hex.Length - 1; i++)
{
var sb = new System.Text.StringBuilder();
sb.Append(data[i * 32].ToString("X2"));
for (var k = 1; k <= 32 - 1; k++)
{
sb.Append(' ');
sb.Append(data[i * 32 + k].ToString("X2"));
}
hex[i] = sb.ToString();
}
This version does the same but is several orders of magnitude faster (133 ms vs 8 minutes).
My problem is that I don't really understand why the original version is so slow. I looked at the source of String.Join()
and it looks pretty similar to my improved version.
I like to use LINQ for these kind of thinks, because you can solve all kinds of problems pretty easily and I thought it was efficient in most cases because of it's lazy evaluation. So I would like to know what I am missing here to improve my future usage of LINQ.
As a side not I know that it could probably be written even faster but this is really not the point here, because the second version is fast enough for a function only used for debugging purposes.
Upvotes: 4
Views: 359
Reputation: 416131
My problem is that I don't really understand why the original version is so slow.
It's this part:
hex.Skip(r * 32)
.Skip()
has to walk the sequence. It doesn't get to jump straight to the correct index. In other words, for every 32 bytes in the array, you re-walk the entire array from the beginning until you get to the start of the current chunk. It's a Shlemiel the Painter situation.
You could potentially also make the original code faster by using an ArraySegment
type, Array.Copy()
, or Span<string>
. You could also write your own linq-like "Chunk()" operator to return 32-byte sequences from an original IEnumerable
, or use this very simple Segment()
method:
public static IEnumerable<T> Segment<T>(this T[] original, int start, int length)
{
length = start + length;
while (start < length)
yield return original[start++];
}
which would change the original code to look like this:
byte[] data = new byte[4*1024*1024];
string[] hex = data.Select((b) => b.ToString("X2")).ToArray();
hex = Enumerable.Range(0, data.Length / 32).Select((r) => String.Join(" ", hex.Segment(r * 32,32))).ToArray();
and, for fun, using the Chunk()
implementation I linked earlier:
byte[] data = new byte[4*1024*1024];
var hex = data.Select(b => b.ToString("X2"))
.Chunk(32)
.Select(c => string.Join(" ", c))
.ToArray(); //only call ToArray() if you *really* need the array. Often the enumerable is enough.
Another fun option using String.Create()
byte[] data = new byte[4*1024*1024];
char[] hexChars = { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
'A', 'B', 'C', 'D', 'E', 'F' };
var hex = data.Chunk(32)
.Select(c => string.Create(95, c, (r, d) => {
int i = 0;
foreach(byte b in d)
{
r[i*3] = hexChars[((b & 0xf0) >> 4)];
r[(i*3) + 1] = hexChars[(b & 0x0f)];
if (i*3 < 92) r[(i*3) + 2] = ' ';
i++;
}
}))
.ToArray();
You should also look at this BitConverter.ToString()
overload.
I'd love to see how each of these benchmark.
Upvotes: 5
Reputation: 43886
The Take
implementation of .NET Framework does not include any optimization for sources of type IList
, so it becomes very slow when called repeatedly for large lists or arrays. The corresponding implementation of .NET Core includes these optimizations, so it performs quite decently (on a par with a manually coded loop).
Upvotes: 2