Reputation: 19570
I'm working on an audit log which saves sessions in RavenDB. Initially, the website for querying the audit logs was responsive enough but as the amount of logged data has increased, the search page became unusable (it times out before returning using default settings - regardless of the query used). Right now we have about 45mil sessions in the table that gets queried but steady state is expected to be around 150mil documents.
The problem is that with this much live data, playing around to test things has become impractical. I hope some one can give me some ideas what would be the most productive areas to investigate.
The index looks like this:
public AuditSessions_WithSearchParameters()
{
Map = sessions => from session in sessions
select new Result
{
ApplicationName = session.ApplicationName,
SessionId = session.SessionId,
StartedUtc = session.StartedUtc,
User_Cpr = session.User.Cpr,
User_CprPersonId = session.User.CprPersonId,
User_ApplicationUserId = session.User.ApplicationUserId
};
Store(r => r.ApplicationName, FieldStorage.Yes);
Store(r => r.StartedUtc, FieldStorage.Yes);
Store(r => r.User_Cpr, FieldStorage.Yes);
Store(r => r.User_CprPersonId, FieldStorage.Yes);
Store(r => r.User_ApplicationUserId, FieldStorage.Yes);
}
The essense of the query is this bit:
// Query input paramters
var fromDateUtc = fromDate.ToUniversalTime();
var toDateUtc = toDate.ToUniversalTime();
sessionQuery = sessionQuery
.Where(s =>
s.ApplicationName == applicationName &&
s.StartedUtc >= fromDateUtc &&
s.StartedUtc <= toDateUtc
);
var totalItems = Count(sessionQuery);
var sessionData =
sessionQuery
.OrderByDescending(s => s.StartedUtc)
.Skip((page - 1) * PageSize)
.Take(PageSize)
.ProjectFromIndexFieldsInto<AuditSessions_WithSearchParameters.ResultWithAuditSession>()
.Select(s => new
{
s.SessionId,
s.SessionGroupId,
s.ApplicationName,
s.StartedUtc,
s.Type,
s.ResourceUri,
s.User,
s.ImpersonatingUser
})
.ToList();
First, to determine the number of pages of results, I count the number of results in my query using this method:
private static int Count<T>(IRavenQueryable<T> results)
{
RavenQueryStatistics stats;
results.Statistics(out stats).Take(0).ToArray();
return stats.TotalResults;
}
This turns out to be very expensive in itself, so optimizations are relevant both here and in the rest of the query.
The query time is not related to the amount of result items in any relevant way. If I use a different value for the applicationName
parameter than any of the results, it is just as slow.
One area of improvement could be to use sequential IDs for the sessions. For reasons not relevant to this post, I found it most practical to use guid based ids. I'm not sure if I can easily change IDs of the existing values (with this much data) and I would prefer not to drop the data (but might if the expected impact is large enough). I understand that sequential ids result in better behaving b-trees for the indexes, but I have no idea how significant the impact is.
Another approach could be to include a timestamp in the id and query for documents with ids starting with the string matching enough of the time to filter the result. An example id could be AuditSessions/2017-12-31-24-31-42/bc835d6c-2fba-4591-af92-7aab96339d84
. This also requires me to update or drop all the existing data. This of course also has the benefits of mostly sequential ids.
A third approach could be to move old data into a different collection over time, in recognition of the fact that you would most often look at the most recent data. This requires a background job and support for querying across collection time boundaries. It also has the issue that the collection with the old sessions is still slow if you need to access it.
I'm hoping there is something simpler than these solutions, such as modifying the query or the indexed fields in a way that avoids a lot of work.
Upvotes: 1
Views: 92
Reputation: 22956
At a glance, it is probably related to the range query on the StartedUtc
.
I'm assuming that you are using exact numbers, so you have a LOT of distinct values there.
If you can, you can dramatically reduce the cost by changing the index to index on a second / minute granularity (which is usually what you are querying on), and then use Ticks
, which allow us to use numeric range query.
StartedUtcTicks = new Datetime(session.StartedUtc.Year, session.StartedUtc.Month, session.StartedUtc.Day, session.StartedUtc.Hour, session.StartedUtc.Minute, session.StartedUtc.Second).Ticks,
And then query by the date ticks.
Upvotes: 1