curiousBoy
curiousBoy

Reputation: 6834

EF Core Single vs. Split Queries

I am using EF Core 7. It looks like, since EF Core 5, there is now Single vs Split Query execution.

I see that the default configuration still uses the Single Query execution though.

I noticed in my logs it was saying:

Microsoft.EntityFrameworkCore.Query.MultipleCollectionIncludeWarning': Compiling a query which loads related collections for more than one collection navigation, either via 'Include' or through projection, but no 'QuerySplittingBehavior' has been configured. By default, Entity Framework will use 'QuerySplittingBehavior.SingleQuery', which can potentially result in slow query performance.

Then I configured a warning on db context to get more details:

services.AddDbContextPool<TheBestDbContext>(
    options => options.UseSqlServer(configuration.GetConnectionString("TheBestDbConnection"))
        .ConfigureWarnings(warnings => warnings.Throw(RelationalEventId.MultipleCollectionIncludeWarning))
);

Then I was able to specifically see which call was actually causing that warning.

var user = await _userManager.Users
    .Include(x => x.UserRoles)
    .ThenInclude(x => x.ApplicationRole)
    .ThenInclude(x => x.RoleClaims)
    .SingleOrDefaultAsync(u => u.Id == userId);

So basically same code would be like:

    var user = await _userManager.Users
         .Include(x => x.UserRoles)
         .ThenInclude(x => x.ApplicationRole)
         .ThenInclude(x => x.RoleClaims)
         .AsSplitQuery() // <===
         .SingleOrDefaultAsync(u => u.Id == userId);

with Split query option.

I went through the documentation, but I'm still not sure how to create a pattern out of it.

I would like to set the most common one as a default value across the project, and only use the other for specific scenarios.

Based on the documentation, I have a feeling that the "Split" should be used as default in general but with caution. I also noticed on their documentation specific to pagination, that it says:

When using split queries with Skip/Take, pay special attention to making your query ordering fully unique; not doing so could cause incorrect data to be returned. For example, if results are ordered only by date, but there can be multiple results with the same date, then each one of the split queries could each get different results from the database. Ordering by both date and ID (or any other unique property or combination of properties) makes the ordering fully unique and avoids this problem. Note that relational databases do not apply any ordering by default, even on the primary key.

which completely makes sense as the query will be split.

But if we are mainly fetching from database for a single record, regardless how big or small the include list with its navigation properties, should I always go with "Split" approach?

I would love to hear if there are any best practices on that and when to use which approach.

Upvotes: 22

Views: 17162

Answers (3)

Steve Py
Steve Py

Reputation: 34908

The documentation at https://learn.microsoft.com/en-us/ef/core/querying/single-split-queries outlines the considerations when Split Queries could have unintentional consequences, particularly around isolation and ordering. As mentioned when loading a single record with related details, a single query execution is generally perferred. The warning is appearing because you have a one-to-many, which contains a one-to-many, so it is warning that this can potentially lead to a much larger Cartesian Product in terms of a JOIN-based query. To avoid the warning as you are confident that the query is reasonable in size, you can specify .AsSingleQuery() explicitly and the warning should disappear.

When working with object graphs like this you can consider designing operations against the data state to be as atomic as possible. IF you are editing a User that has Roles & Claims, rather than loading everything for a User and attempting to edit the entire graph in memory in one go, you might structure the application to perform actions like "AddRoleToUser", "RemoveRoleFromUser", AddClaimToUserRole", etc. So instead of loading User /w Roles /w Claims, these actions just load Roles for a user, or Claims for a UserRole respectively to alter this data.

Upvotes: 4

curiousBoy
curiousBoy

Reputation: 6834

After searching through this to figure out if there is any pattern to apply this, and with all the great content provided at the bottom, I was still not sure as I was looking for "When to use split queries" and "when not to", so I tried the summarized my understanding at the bottom.

I will use the same example that Microsoft shows on Single vs Split Queries

var blogs = ctx.Blogs
    .Include(b => b.Posts)
    .Include(b => b.Contributors)
    .ToList();

and here is the generated SQL for that:

SELECT [b].[Id], [b].[Name], [p].[Id], [p].[BlogId], [p].[Title], [c].[Id], [c].[BlogId], [c].[FirstName], [c].[LastName]
FROM [Blogs] AS [b]
LEFT JOIN [Posts] AS [p] ON [b].[Id] = [p].[BlogId]
LEFT JOIN [Contributors] AS [c] ON [b].[Id] = [c].[BlogId]
ORDER BY [b].[Id], [p].[Id]

Microsoft says:

In this example, since both Posts and Contributors are collection navigations of Blog - they're at the same level - relational databases return a cross product: each row from Posts is joined with each row from Contributors. This means that if a given blog has 10 posts and 10 contributors, the database returns 100 rows for that single blog. This phenomenon - sometimes called cartesian explosion - can cause huge amounts of data to unintentionally get transferred to the client, especially as more sibling JOINs are added to the query; this can be a major performance issue in database applications.

However what it doesn't clearly mention is, other than sorting/ordering issues, this may easily mess up the performance of the queries.

First concern is, we are going to be hitting to database multiple times in that case.

Let's check this one:

using (var context = new BloggingContext())
{
    var blogs = context.Blogs
        .Include(blog => blog.Posts)
        .AsSplitQuery()
        .ToList();
}

And check out the generated SQL when .AsSplitQuery() is used.

SELECT [b].[BlogId], [b].[OwnerId], [b].[Rating], [b].[Url]
FROM [Blogs] AS [b]
ORDER BY [b].[BlogId]

SELECT [p].[PostId], [p].[AuthorId], [p].[BlogId], [p].[Content], [p].[Rating], [p].[Title], [b].[BlogId]
FROM [Blogs] AS [b]
INNER JOIN [Posts] AS [p] ON [b].[BlogId] = [p].[BlogId]
ORDER BY [b].[BlogId]

So above query was kind of surprised me. It is interesting that when it uses the split option, it still joins on the second query even though second query should only be pulling data from posts table. Pretty sure EF Core folks had some idea behind that but it just doesn't make sense to me. Then what is the point of having that foreign key over there?

Looks like Microsoft was mainly focused on a solution to avoid cartesian explosion problem but obviously it doesn't mean that "split queries" should be used as best practices by default going forward. Definitely not!

And another possible problem I can think of is data inconsistency, yet the queries are ran separate, you can't guarantee the data consistency. (unless completely locked)

I just don't want to throw away the feature of course. There are still some "good" scenarios to use Split Queries imo, (unless you are really worried about the data consistency) like if we are returning lots of columns with a relation and the size is pretty large, then this could be really performance factor. Or the parent data is not a lot, but tons of navigation sets, then there is your cartesian explosion.

PS: Note that cartesian explosion does not occur when the two JOINs aren't at the same level.

Last but not least, personally, if I am really going to be pulling some heavy amount of data with bunch of relation of relation of relation, I would still prefer those "good old" Stored Procedures. It never gets old!

Upvotes: 6

Svyatoslav Danyliv
Svyatoslav Danyliv

Reputation: 27376

But if we are mainly fetching from database for a single record, regardless how big or small the include list with its navigation properties, should I always go with "Split" approach?

It depends, let's examine your example in Single query approach:

var user = await _userManager.Users              // 1 records based on SingleOrDefault but to server goes TAKE 2
    .Include(x => x.UserRoles)                   // R roles
    .ThenInclude(x => x.ApplicationRole)         // 1 record 
    .ThenInclude(x => x.RoleClaims)              // C claims
    .SingleOrDefaultAsync(u => u.Id == userId);

As result on the client will be returned RecordCount = 1 * R * 1 * C records. Then they will be deduplicated and placed in appropriate collections. If RecordCount is approximately small Single query can be best approach.

Also EF Core adds ORDER BY for such query which may slowdown execution. So better examine execution plan.

Side note: Better to use FirstOrDefault/Async it CAN be a lot faster than SingleOrDefault/Async, when SQL server fails to detect that there no 2 records in recordset early.

Upvotes: 3

Related Questions