raziel
raziel

Reputation: 23

Linq Optimization for Count And Group By

i've written written a code for counting each byte frequency in binary file. Using Linq. Code seem to slow when performing the Linq expression. Its seem hard to implement Parallelism on this kind of logic. To build the freq table over 475MB it took approx 1 mins.

class Program
{
    static void Main(string[] args)
    {
        Dictionary<byte, int> freq = new Dictionary<byte, int>();
        Stopwatch sw = new Stopwatch();


        sw.Start();
        //File Size 478.668 KB
        byte[] ltext = File.ReadAllBytes(@"D:\Setup.exe");
        sw.Stop();

        Console.WriteLine("Reading File {0}", GetTime(sw));




        sw.Start();
        Dictionary<byte, int> result = (from i in ltext
                                     group i by i into g
                                     orderby g.Count() descending
                                     select new { Key = g.Key, Freq = g.Count() })
                                    .ToDictionary(x => x.Key, x => x.Freq);
        sw.Stop();
        Console.WriteLine("Generating Freq Table {0}", GetTime(sw));


        foreach (var i in result)
        {
            Console.WriteLine(i);
        }
        Console.WriteLine(result.Count);
        Console.ReadLine();
    }

    static string GetTime(Stopwatch sw)
    {
        TimeSpan ts = sw.Elapsed;
        string elapsedTime = String.Format("{0} min {1} sec {2} ms",ts.Minutes, ts.Seconds, ts.Milliseconds);
        return elapsedTime;
    }

I've tried to implement non linq solution using few loops, the performance its about the same. Please, any advice to optimize this. Sorry For my bad English

Upvotes: 2

Views: 1083

Answers (3)

Hans Passant
Hans Passant

Reputation: 942508

This took a bit over a second on a 442MB file on my otherwise poky Dell laptop:

        byte[] ltext = File.ReadAllBytes(@"c:\temp\bigfile.bin");
        var freq = new long[256];
        var sw = Stopwatch.StartNew();
        foreach (byte b in ltext) {
            freq[b]++;
        }
        sw.Stop();
        Console.WriteLine(sw.ElapsedMilliseconds);

Very hard to beat the raw perf of an array.

Upvotes: 2

Mike Scott
Mike Scott

Reputation: 12473

The following displays the frequency of bytes in descending order in a 465MB file on my machine in under 9 seconds when build in release mode.

Note, I've made it faster by reading the file in 100000 byte blocks (you can experiment with this - 16K blocks made no appreciable difference on my machine). The point is that the inner loop is the one supplying bytes. Calling Stream.ReadByte() is fast but not nearly as fast as indexing a byte in an array.

Also, reading the whole file into memory exerts extreme memory pressure which will hamper performance and will fail completely if the file is large enough.

using System;
using System.Diagnostics;
using System.IO;
using System.Linq;

class Program
{
    static void Main( string[] args )
    {
        Console.WriteLine( "Reading file..." );
        var sw = Stopwatch.StartNew();
        var frequency = new long[ 256 ];
        using ( var input = File.OpenRead( @"c:\Temp\TestFile.dat" ) )
        {
            var buffer = new byte[ 100000 ];
            int bytesRead;
            do
            {
                bytesRead = input.Read( buffer, 0, buffer.Length );
                for ( var i = 0; i < bytesRead; i++ )
                    frequency[ buffer[ i ] ]++;
            } while ( bytesRead == buffer.Length );
        }
        Console.WriteLine( "Read file in " + sw.ElapsedMilliseconds + "ms" );

        var result = frequency.Select( ( f, i ) => new ByteFrequency { Byte = i, Frequency = f } )
            .OrderByDescending( x => x.Frequency );
        foreach ( var byteCount in result )
            Console.WriteLine( byteCount.Byte + " " + byteCount.Frequency );
    }

    public class ByteFrequency
    {
        public int Byte { get; set; }
        public long Frequency { get; set; }
    }
}

Upvotes: 2

Vlad
Vlad

Reputation: 35594

Why not just

int[] freq = new int[256];
foreach (byte b in ltext)
    freq[b]++;

?

Upvotes: 1

Related Questions