Rob
Rob

Reputation: 11788

Most performant way to filter a huge List in C#?

We have an ASP.NET MVC web application that is hooked up to a SQL Server DB via Entity Framework. One of the main tasks of this application is to allow users to quickly search and filter a huge database table that holds archive values.

The table structure is quite simple: Timestamp (DateTime), StationId (int), DatapointId (int), Value (double). This table holds somewhat between 10 to 100 million rows. I optimized the DB table with a covering index etc., but the user experience was still quite laggy when filtering by DatapointId, StationId, Time and Skipping and Taking only the portion I want to show on the page.

So I tried a different approach: as our server has a lot of RAM, I thought that we could simply load the whole archive table into a List<ArchiveRow> when the web app starts up, then simply get the data directly from this List instead of doing a round-trip to the database. This works quite well, it takes about 9 seconds to load the whole archive table (with currently about 10 million entries) into the List. The ArchiveRow is a simple object that looks like this:

public class ArchiveResponse {
    public  int Length { get; set; }
    public int numShown { get; set; }
    public  int numFound { get; set; }
    public  int numTotal { get; set; }
    public  List<ArchiveRow> Rows { get; set; }
}

and accordingly:

public class ArchiveRow {
    public  int s { get; set; }
    public  int d { get; set; }
    public  DateTime t { get; set; }
    public  double v { get; set; }
}    

When I now try to get the desired data from the List with a Linq query, it is already faster the querying the DB, but it is still quite slow when filtering by multiple criteria. For example, when I filter by one StationId and 12 DatapointIds, it takes about 5 seconds to retrieve a window of 25 rows. I already switched from filtering with Where to using joins, but I think there is still room for improvement. Are there maybe better ways to implement such a caching mechanism while keeping memory consumption as low as possible? Are there other collection types that are better suited for this purpose?

So here is the code that filters and fetches the relevant data from the ArchiveCache list:

// Total number of entries in archive cache
var numTotal = ArchiveCache.Count();

// Initial Linq query
ParallelQuery<ArchiveCacheValue> query = ArchiveCache.AsParallel();

// The request may contain StationIds that the user is interested in,
// so here's the filtering by StationIds with a join:
if (request.StationIds.Count > 0)
{
    query = from a in ArchiveCache.AsParallel()
             join b in request.StationIds.AsParallel()
             on a.StationId equals b
             select a;
}

// The request may contain DatapointIds that the user is interested in,
// so here's the filtering by DatapointIds with a join:
if (request.DatapointIds.Count > 0)
{
    query = from a in query.AsParallel()
             join b in request.DatapointIds.AsParallel()
             on a.DataPointId equals b
             select a;
}

// Number of matching entries after filtering and before windowing
int numFound = query.Count();

// Pagination: Select only the current window that needs to be shown on the page
var result = query.Skip(request.Start == 0 ? 0 : request.Start - 1).Take(request.Length);

// Number of entries on the current page that will be shown
int numShown = result.Count();

// Build a response object, serialize it to Json and return to client
// Note: The projection with the Rows is not a bottleneck, it is only done to
// shorten 'StationId' to 's' etc. At this point there are only 25 to 50 rows,
// so that is no problem and happens in way less than 1 ms
ArchiveResponse myResponse = new ArchiveResponse();
myResponse.Length = request.Length;
myResponse.numShown = numShown;
myResponse.numFound = numFound;
myResponse.numTotal = numTotal;
myResponse.Rows = result.Select(x => new archRow() { s = x.StationId, d = x.DataPointId, t = x.DateValue, v = x.Value }).ToList();

return JsonSerializer.ToJsonString(myResponse);

Some more details: the number of stations is usually something between 5 to 50, rarely more than 50. The number of datapoints is <7000. The web application is set to 64 bit with <gcAllowVeryLargeObjects enabled="true" /> set in the web.config.

I'm really looking forward for further improvements and recommendations. Maybe there's a completely different approach based on arrays or similar, that performs way better without linq?

Upvotes: 7

Views: 7193

Answers (4)

Danny_ds
Danny_ds

Reputation: 11406

I would at least store those values in an array or vector of struct ArchiveRow to make sure all data is in contiguous memory. This way you will greatly benefit from locality of reference (i.e. use the L1 cache efficiently). And you also avoid the overhead of a list (pointers/references). (Update: I did a quick lookup of C# List. Looks like List is the same as C++ vector (i.e. an array), and C# LinkedList is the same as C++ list.. a bit confusing - iow, no pointer overhead here for the C# List<>)

Try to make the struct as small as possible. Use uint32 or even uint16 where possible, maybe 32 bit for the datetime and maybe even float instead of double (depending on your values). Also put the widest values first in your struct (for better alignment).

Even the brute force approach (scanning the whole array, a few 100MB of contiguous memory) should finish in well under 1 second.

To optimize further you can sort the data (or better, retrieve it sorted from the database) which will allow you to do a binary search, and also keep the values for the resultset close together. For example: sort on StationId, DataPointId.

If the data gets bigger, you could store the same structure on disk (in a binary file) and access it via memory mapping.

Also, you only have to scan for the first 25 items. You could then store the last checked position (for that session / query), and when the next page is asked, start from there for the next 25 items. This will also save on memory which would be needed to store the full resultset.

If the number of stations is small, and the data is sorted on StationId, you could also keep a small list or jump table (while importing) with the starting position of each StationId and directly jump to there to scan for datapoints of that station.


it takes about 9 seconds to load the whole archive table (with currently about 10 million entries) into the List.

Just in case you didn't do this already, make sure to set an initial Capacity on the list to avoid multiple reallocations.

Upvotes: 3

Evk
Evk

Reputation: 101493

You can adjust your storage to this specific query type. First, create dictionaries from your in-memory archive:

ArchiveCacheByDatapoint = ArchiveCache.GroupBy(c => c.DataPointId)
            .ToDictionary(c => c.Key, c => c.ToList());
ArchiveCacheByStation = ArchiveCache.GroupBy(c => c.StationId)
            .ToDictionary(c => c.Key, c => c.ToList());

Then use those dictionaries in your query:

bool hasStations = request.StationIds.Length > 0;
bool hasDatapoints = request.DatapointIds.Length > 0;            
int numFound = 0;
List<ArchiveCacheValue> result;
if (hasDatapoints && hasStations) {
    // special case - filter by both
    result = new List<ArchiveCacheValue>();
    // store station filter in hash set
    var stationsFilter = new HashSet<int>(request.StationIds);
    // first filter by datapoints, because you have more different datapoints than stations
    foreach (var datapointId in request.DatapointIds.OrderBy(c => c)) {                    
        foreach (var item in ArchiveCacheByDatapoint[datapointId]) {                        
            if (stationsFilter.Contains(item.StationId)) {
                // both datapoint and station matches filter - found item
                numFound++;
                if (numFound >= request.Start && result.Count < request.Length) {
                    // add to result list if matches paging criteria
                    result.Add(item);
                }
            }
        }
    }
}
else if (hasDatapoints) {                
    var query = Enumerable.Empty<ArchiveCacheValue>();                
    foreach (var datapoint in request.DatapointIds.OrderBy(c => c))
    {
        var list = ArchiveCacheByDatapoint[datapoint];
        numFound += list.Count;
        query = query.Concat(list);
    }
    // execute query just once
    result = query.Skip(request.Start).Take(request.Length).ToList();
}
else if (hasStations) {                
    var query = Enumerable.Empty<ArchiveCacheValue>();
    foreach (var station in request.StationIds.OrderBy(c => c))
    {
        var list = ArchiveCacheByStation[station];
        numFound += list.Count;
        query = query.Concat(list);
    }
    // execute query just once
    result = query.Skip(request.Start).Take(request.Length).ToList();
}
else {
    // no need to do Count()
    numFound = ArchiveCache.Count;
    // no need to Skip\Take here really, ArchiveCache is list\array
    // so you can use indexes which will be faster
    result = ArchiveCache.Skip(request.Start).Take(request.Length).ToList();
}

// Number of entries on the current page that will be shown
int numShown = result.Count;

I've measured it and on my machine it runs in 1ms (sometimes up to 10ms) for all types of queries I tried (by sections only, by datapoints only, by both sections and datapoints), for 100 million items.

Upvotes: 6

vasil oreshenski
vasil oreshenski

Reputation: 2836

Another optimization. Using linq join in memory is not that efficient.

So instead of joining the request.StationIds and request.DatapointIds i would create hashsets from them and use simple contains over the hashset. Something like this.

if (request.StationIds.Count > 0)
{
    var stationIdSet = new HashSet<int>(request.StationIds);
    query =  (from a in ArchiveCache
             stationIdSet.Contains(a.StationId)
             select a);
               // .AsParallel()
               // .MaxDegreeOfParallelism(...);
}

Running it in sequental manner for 9m records and 30 station ids, out performed the join with roughly 150 - 250 ms

For the parallel version with MaxDegreeOfParallelism = 2, both (the join and hashset) were performing worse.

NOTE: AsParallel is putting overhead which for simple operations like those it may not be the best option.

Upvotes: 1

nvoigt
nvoigt

Reputation: 77304

The answer of Evk has some good optimizations, I'd just wanted to point out some very unoptimized things in your original code, maybe removing them would speed it up already:

You do your query in-memory three times!

// execute query first time
int numFound = query.Count();

var result = query.Skip(request.Start == 0 ? 0 : request.Start - 1).Take(request.Length);

// executing query second time!
int numShown = result.Count();

// executing query third time!    
// result.Select [...] .ToList()

You should really do it once and then do the counting and pagination on that result.

Upvotes: 1

Related Questions