Reputation: 23
i've written written a code for counting each byte frequency in binary file. Using Linq. Code seem to slow when performing the Linq expression. Its seem hard to implement Parallelism on this kind of logic. To build the freq table over 475MB it took approx 1 mins.
class Program
{
static void Main(string[] args)
{
Dictionary<byte, int> freq = new Dictionary<byte, int>();
Stopwatch sw = new Stopwatch();
sw.Start();
//File Size 478.668 KB
byte[] ltext = File.ReadAllBytes(@"D:\Setup.exe");
sw.Stop();
Console.WriteLine("Reading File {0}", GetTime(sw));
sw.Start();
Dictionary<byte, int> result = (from i in ltext
group i by i into g
orderby g.Count() descending
select new { Key = g.Key, Freq = g.Count() })
.ToDictionary(x => x.Key, x => x.Freq);
sw.Stop();
Console.WriteLine("Generating Freq Table {0}", GetTime(sw));
foreach (var i in result)
{
Console.WriteLine(i);
}
Console.WriteLine(result.Count);
Console.ReadLine();
}
static string GetTime(Stopwatch sw)
{
TimeSpan ts = sw.Elapsed;
string elapsedTime = String.Format("{0} min {1} sec {2} ms",ts.Minutes, ts.Seconds, ts.Milliseconds);
return elapsedTime;
}
I've tried to implement non linq solution using few loops, the performance its about the same. Please, any advice to optimize this. Sorry For my bad English
Upvotes: 2
Views: 1083
Reputation: 942508
This took a bit over a second on a 442MB file on my otherwise poky Dell laptop:
byte[] ltext = File.ReadAllBytes(@"c:\temp\bigfile.bin");
var freq = new long[256];
var sw = Stopwatch.StartNew();
foreach (byte b in ltext) {
freq[b]++;
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
Very hard to beat the raw perf of an array.
Upvotes: 2
Reputation: 12473
The following displays the frequency of bytes in descending order in a 465MB file on my machine in under 9 seconds when build in release mode.
Note, I've made it faster by reading the file in 100000 byte blocks (you can experiment with this - 16K blocks made no appreciable difference on my machine). The point is that the inner loop is the one supplying bytes. Calling Stream.ReadByte() is fast but not nearly as fast as indexing a byte in an array.
Also, reading the whole file into memory exerts extreme memory pressure which will hamper performance and will fail completely if the file is large enough.
using System;
using System.Diagnostics;
using System.IO;
using System.Linq;
class Program
{
static void Main( string[] args )
{
Console.WriteLine( "Reading file..." );
var sw = Stopwatch.StartNew();
var frequency = new long[ 256 ];
using ( var input = File.OpenRead( @"c:\Temp\TestFile.dat" ) )
{
var buffer = new byte[ 100000 ];
int bytesRead;
do
{
bytesRead = input.Read( buffer, 0, buffer.Length );
for ( var i = 0; i < bytesRead; i++ )
frequency[ buffer[ i ] ]++;
} while ( bytesRead == buffer.Length );
}
Console.WriteLine( "Read file in " + sw.ElapsedMilliseconds + "ms" );
var result = frequency.Select( ( f, i ) => new ByteFrequency { Byte = i, Frequency = f } )
.OrderByDescending( x => x.Frequency );
foreach ( var byteCount in result )
Console.WriteLine( byteCount.Byte + " " + byteCount.Frequency );
}
public class ByteFrequency
{
public int Byte { get; set; }
public long Frequency { get; set; }
}
}
Upvotes: 2
Reputation: 35594
Why not just
int[] freq = new int[256];
foreach (byte b in ltext)
freq[b]++;
?
Upvotes: 1