Reputation: 11788
We have an ASP.NET MVC web application that is hooked up to a SQL Server DB via Entity Framework. One of the main tasks of this application is to allow users to quickly search and filter a huge database table that holds archive values.
The table structure is quite simple: Timestamp (DateTime), StationId (int), DatapointId (int), Value (double). This table holds somewhat between 10 to 100 million rows. I optimized the DB table with a covering index etc., but the user experience was still quite laggy when filtering by DatapointId, StationId, Time and Skipping and Taking only the portion I want to show on the page.
So I tried a different approach: as our server has a lot of RAM, I thought that we could simply load the whole archive table into a List<ArchiveRow>
when the web app starts up, then simply get the data directly from this List instead of doing a round-trip to the database. This works quite well, it takes about 9 seconds to load the whole archive table (with currently about 10 million entries) into the List. The ArchiveRow
is a simple object that looks like this:
public class ArchiveResponse {
public int Length { get; set; }
public int numShown { get; set; }
public int numFound { get; set; }
public int numTotal { get; set; }
public List<ArchiveRow> Rows { get; set; }
}
and accordingly:
public class ArchiveRow {
public int s { get; set; }
public int d { get; set; }
public DateTime t { get; set; }
public double v { get; set; }
}
When I now try to get the desired data from the List with a Linq query, it is already faster the querying the DB, but it is still quite slow when filtering by multiple criteria. For example, when I filter by one StationId and 12 DatapointIds, it takes about 5 seconds to retrieve a window of 25 rows. I already switched from filtering with Where
to using joins, but I think there is still room for improvement. Are there maybe better ways to implement such a caching mechanism while keeping memory consumption as low as possible? Are there other collection types that are better suited for this purpose?
So here is the code that filters and fetches the relevant data from the ArchiveCache list:
// Total number of entries in archive cache
var numTotal = ArchiveCache.Count();
// Initial Linq query
ParallelQuery<ArchiveCacheValue> query = ArchiveCache.AsParallel();
// The request may contain StationIds that the user is interested in,
// so here's the filtering by StationIds with a join:
if (request.StationIds.Count > 0)
{
query = from a in ArchiveCache.AsParallel()
join b in request.StationIds.AsParallel()
on a.StationId equals b
select a;
}
// The request may contain DatapointIds that the user is interested in,
// so here's the filtering by DatapointIds with a join:
if (request.DatapointIds.Count > 0)
{
query = from a in query.AsParallel()
join b in request.DatapointIds.AsParallel()
on a.DataPointId equals b
select a;
}
// Number of matching entries after filtering and before windowing
int numFound = query.Count();
// Pagination: Select only the current window that needs to be shown on the page
var result = query.Skip(request.Start == 0 ? 0 : request.Start - 1).Take(request.Length);
// Number of entries on the current page that will be shown
int numShown = result.Count();
// Build a response object, serialize it to Json and return to client
// Note: The projection with the Rows is not a bottleneck, it is only done to
// shorten 'StationId' to 's' etc. At this point there are only 25 to 50 rows,
// so that is no problem and happens in way less than 1 ms
ArchiveResponse myResponse = new ArchiveResponse();
myResponse.Length = request.Length;
myResponse.numShown = numShown;
myResponse.numFound = numFound;
myResponse.numTotal = numTotal;
myResponse.Rows = result.Select(x => new archRow() { s = x.StationId, d = x.DataPointId, t = x.DateValue, v = x.Value }).ToList();
return JsonSerializer.ToJsonString(myResponse);
Some more details: the number of stations is usually something between 5 to 50, rarely more than 50. The number of datapoints is <7000. The web application is set to 64 bit with <gcAllowVeryLargeObjects enabled="true" />
set in the web.config.
I'm really looking forward for further improvements and recommendations. Maybe there's a completely different approach based on arrays or similar, that performs way better without linq?
Upvotes: 7
Views: 7193
Reputation: 11406
I would at least store those values in an array
or vector
of struct ArchiveRow
to make sure all data is in contiguous memory. This way you will greatly benefit from locality of reference (i.e. use the L1 cache efficiently). And you also avoid the overhead of a list
(pointers/references). (Update: I did a quick lookup of C# List
. Looks like List
is the same as C++ vector
(i.e. an array), and C# LinkedList
is the same as C++ list
.. a bit confusing - iow, no pointer overhead here for the C# List<>
)
Try to make the struct as small as possible. Use uint32
or even uint16
where possible, maybe 32 bit for the datetime
and maybe even float
instead of double
(depending on your values). Also put the widest values first in your struct (for better alignment).
Even the brute force approach (scanning the whole array, a few 100MB of contiguous memory) should finish in well under 1 second.
To optimize further you can sort the data (or better, retrieve it sorted from the database) which will allow you to do a binary search, and also keep the values for the resultset close together. For example: sort on StationId
, DataPointId
.
If the data gets bigger, you could store the same structure on disk (in a binary file) and access it via memory mapping.
Also, you only have to scan for the first 25 items. You could then store the last checked position (for that session / query), and when the next page is asked, start from there for the next 25 items. This will also save on memory which would be needed to store the full resultset.
If the number of stations is small, and the data is sorted on StationId
, you could also keep a small list or jump table (while importing) with the starting position of each StationId
and directly jump to there to scan for datapoints of that station.
it takes about 9 seconds to load the whole archive table (with currently about 10 million entries) into the List.
Just in case you didn't do this already, make sure to set an initial Capacity on the list to avoid multiple reallocations.
Upvotes: 3
Reputation: 101493
You can adjust your storage to this specific query type. First, create dictionaries from your in-memory archive:
ArchiveCacheByDatapoint = ArchiveCache.GroupBy(c => c.DataPointId)
.ToDictionary(c => c.Key, c => c.ToList());
ArchiveCacheByStation = ArchiveCache.GroupBy(c => c.StationId)
.ToDictionary(c => c.Key, c => c.ToList());
Then use those dictionaries in your query:
bool hasStations = request.StationIds.Length > 0;
bool hasDatapoints = request.DatapointIds.Length > 0;
int numFound = 0;
List<ArchiveCacheValue> result;
if (hasDatapoints && hasStations) {
// special case - filter by both
result = new List<ArchiveCacheValue>();
// store station filter in hash set
var stationsFilter = new HashSet<int>(request.StationIds);
// first filter by datapoints, because you have more different datapoints than stations
foreach (var datapointId in request.DatapointIds.OrderBy(c => c)) {
foreach (var item in ArchiveCacheByDatapoint[datapointId]) {
if (stationsFilter.Contains(item.StationId)) {
// both datapoint and station matches filter - found item
numFound++;
if (numFound >= request.Start && result.Count < request.Length) {
// add to result list if matches paging criteria
result.Add(item);
}
}
}
}
}
else if (hasDatapoints) {
var query = Enumerable.Empty<ArchiveCacheValue>();
foreach (var datapoint in request.DatapointIds.OrderBy(c => c))
{
var list = ArchiveCacheByDatapoint[datapoint];
numFound += list.Count;
query = query.Concat(list);
}
// execute query just once
result = query.Skip(request.Start).Take(request.Length).ToList();
}
else if (hasStations) {
var query = Enumerable.Empty<ArchiveCacheValue>();
foreach (var station in request.StationIds.OrderBy(c => c))
{
var list = ArchiveCacheByStation[station];
numFound += list.Count;
query = query.Concat(list);
}
// execute query just once
result = query.Skip(request.Start).Take(request.Length).ToList();
}
else {
// no need to do Count()
numFound = ArchiveCache.Count;
// no need to Skip\Take here really, ArchiveCache is list\array
// so you can use indexes which will be faster
result = ArchiveCache.Skip(request.Start).Take(request.Length).ToList();
}
// Number of entries on the current page that will be shown
int numShown = result.Count;
I've measured it and on my machine it runs in 1ms (sometimes up to 10ms) for all types of queries I tried (by sections only, by datapoints only, by both sections and datapoints), for 100 million items.
Upvotes: 6
Reputation: 2836
Another optimization. Using linq join
in memory is not that efficient.
So instead of joining the request.StationIds
and request.DatapointIds
i would create hashsets from them and use simple contains over the hashset. Something like this.
if (request.StationIds.Count > 0)
{
var stationIdSet = new HashSet<int>(request.StationIds);
query = (from a in ArchiveCache
stationIdSet.Contains(a.StationId)
select a);
// .AsParallel()
// .MaxDegreeOfParallelism(...);
}
Running it in sequental manner for 9m records and 30 station ids, out performed the join with roughly 150 - 250 ms
For the parallel version with MaxDegreeOfParallelism = 2, both (the join and hashset) were performing worse.
NOTE: AsParallel is putting overhead which for simple operations like those it may not be the best option.
Upvotes: 1
Reputation: 77304
The answer of Evk has some good optimizations, I'd just wanted to point out some very unoptimized things in your original code, maybe removing them would speed it up already:
You do your query in-memory three times!
// execute query first time
int numFound = query.Count();
var result = query.Skip(request.Start == 0 ? 0 : request.Start - 1).Take(request.Length);
// executing query second time!
int numShown = result.Count();
// executing query third time!
// result.Select [...] .ToList()
You should really do it once and then do the counting and pagination on that result.
Upvotes: 1