Girish Sakhare
Girish Sakhare

Reputation: 763

Performance improvement for linq query with distinct

Considering the sample table

Col 1, Col2, Col3
1    , x   , G
1    , y   , H
2    , z   , J
2    , a   , K
2    , a   , K
3    , b   , E

I want below result, i.e distinct rows

1    , x   , G
1    , y   , H
2    , z   , J
2    , a   , K
3    , b   , E

I tried

var Result = Context.Table.Select(C => 
                 new { 
                       Col1 = C.Col1,
                       Col2 = C.Col2,
                       Col3 = C.Col3 
                      }).Distinct();

and

Context.Table.GroupBy(x=>new {x.Col1,x.Col2,x.Col3}).Select(x=>x.First()).ToList();

The results are as expected, however my table has 35 columns and 1 million records and its size will keep on growing, the current time for the query is 22-30 secs, so how to improve the performance and get it down to 2-3 secs?

Upvotes: 2

Views: 1392

Answers (1)

Gilad Green
Gilad Green

Reputation: 37299

Using distinct is the way to go... I'd say that the first approach you tried is the correct one - but do you really need all 1 million rows? See what where conditions you can add or maybe take just the first x records?

var Result = Context.Table.Select(c => new 
    { 
        Col1 = c.Col1,
        Col2 = c.Col2,
        Col3 = c.Col3 
    })
    .Where(c => /*some condition to narrow results*/)
    .Take(1000) //some number of the wanted amount of records
    .Distinct();

What you might be able to do, is to use the rownum to select in bulks. Something like:

public <return type> RetrieveBulk(int fromRow, int toRow)
{
    return Context.Table.Where(record => record.Rownum >= fromRow && record.Rownum < toRow)
        .Select(c => new 
        { 
            Col1 = c.Col1,
            Col2 = c.Col2,
            Col3 = c.Col3 
        }).Distinct();
}

This code you can then do something like:

List<Task<return type>> selectTasks = new List<Task<return type>>();
for(int i = 0; i < 1000000; i+=1000)
{
    selectTasks.Add(Task.Run(() => RetrieveBulk(i, i + 1000)));
}

Task.WaitAll(selectTasks);

//And then intercet data using some efficient structure as a HashSet so when you intersect it wont be o(n)2 but o(n)

Upvotes: 1

Related Questions