Querying nested lists with LINQ instead of loops

Question

Lets say I have the following setup

Continent
--Countries
----Provinces
------Cities

A continent contains a list of many countries which contains a list of many provinces which contains a list of many cities. For each nested list lets say I want to do a check (name length is greater than 5)

Instead of using this loop structure

var countries = dbSet.Countries.Where(c => c.Name.Length > 5);
foreach (var country in countries)
{
   country.Provinces = country.Provinces.Where(p => p.Name.Length > 5);
   foreach (var province in country.Provinces)
   {
      province.Cities = province.Cities.Where(ci => ci.Name.Length() > 5);
   }
}

How could I accomplish the same efficiently with LINQ?

madreflection · Accepted Answer

Efficiently? In terms of written code, sure, but we'll call that "cleanly". In terms of execution, that's not a question you should be asking at this point. Focus on getting the job done in code that's understandable and then "race your horses" to see if you really need to improve on it.

One thing I should caution is that LINQ is about querying, which doesn't mutate the source sequences. You're assigning the filtered sequences back to the properties and that's contrary to LINQ principles. The tag shows you're using Entity Framework so it's definitely not a good idea to do that because it uses its own collection types under the hood.

To answer your question, the SelectMany extension method loops on the projected sequence. When it's translated to a database query, it translates to a join.

dbSet.Countries
    .Where(c => c.Names.Length > 5)
    .SelectMany(c => c.Provinces)
    .Where(p => p.Name.Length > 5)
    .SelectMany(p => p.Cities)
    .Where(ci => ci.Name.Length > 5)
    .Select(ci => ci.Name);

That'll give you the names of all cities where the country, province, and city names are all longer than 5 characters.

But that only gives you the names of the cities. If you want to know each level of information, extension methods are difficult to use because you have to project "transparent identifiers" at each step along the way and it can get pretty cluttered. Let the compiler do that for you by using LINQ syntax.

from c in dbSet.Countries
where c.Name.Length > 5
from p in c.Provinces
where p.Name.Length > 5
from ci in p.Cities
where ci.Name.Length > 5

That will do the same thing as above, except now, all your range variables are carried through the expression so you can do this:

select new
{
    CountryName = c.Name,
    ProvinceName = p.Name,
    CityName = ci.Name
};

...or whatever you want to do with c, p, and ci.

EDIT: Merged the second answer, which addressed questions in the comments, into this one.

In order to preserve the parent levels through the query, you need to project a container for the parent and the child each time you loop through a collection of child objects. When you use LINQ syntax, the compiler does this for you in the form of a "transparent identifier". It's transparent because your references to range variables "go right through" it and you never see it. Jon Skeet touches on them near the end of Reimplementing LINQ to Objects: Part 19 – Join.

To accomplish this, you want to use a different overload of SelectMany this time, one that also takes a lambda to project the container you need. Each iteration through the child items, that lambda is called and passed two parameters, the parent and the current iteration's child item.

var result = dbSet.Countries
    .Where(c => c.Names.Length > 5)
    .SelectMany(c => c.Provinces, (c, p) => new { c, p })
    .Where(x1 => x1.p.Name.Length > 5)
    .SelectMany(x1 => x1.p.Cities, (x1, ci) => new { x1.c, x1.p, ci })
    .Where(x2 => x2.ci.Name.Length > 5)
    .Select(x2 => new
    {
        Country = x2.c,
        Province = x2.p,
        City = x2.ci
    })
    .ToList();

The x1 and x2 lambda arguments are the containers projected from the preceding SelectMany call. I like to call them "opaque identifiers". They're no longer transparent if you have refer to them explicitly.

The c, p, and ci range variables are now properties of those containers.

As a bonus note, when you use a let clause, the compiler's doing the exact same thing, creating a container that has all of the available range variables and the new variable that's being introduced.

I want to end this with a word of advice: Use LINQ syntax as much as possible. It's easier to write and get right, and it's easier to read because you don't have all those projections that the compiler can do for you. If you have to resort to extension methods, do so in parts. The two techniques can be mixed. There's art in keeping it from looking like a mess.

Querying nested lists with LINQ instead of loops

Answers (1)

Related Questions