Slow identification of duplicates using linq

I have two lists of usernames, where BigList contains 20000 usernames and emails, and SmallList contains 1500 usernames. The big list contains duplicate users, in the sense that they have the same email, but unique usernames. The small list has unique usernames. I need to return the shortest username of each duplicate user in list1 (as determined by email) if the user is also present in SmallList.

I have solved the problem using linq, but it takes upwards of 30 seconds, which is too slow:

return BigList.Where(u => SmallList.Contains(u.UserName))
                    .OrderBy(u => u.UserName.Length)
                    .GroupBy(u => u.EmailAddress)
                    .Select(g => g.FirstOrDefault())
                    .Select(u => u.UserName).ToList();

Can anything be done to improve the performance of this query? Thank you!

Upvotes: 0

Views: 52

Answers (2)

Harald Coppoolse
Harald Coppoolse

Reputation: 30454

Why do you order 20.000 users if you will throw away 18.500 of them? Wouldn't it be way more efficient to first select the items that you want and then order them in ascending userName length?

First I'll convert the BigList into groups of users that have the same Email. From all elements in each group I keep the shortest UserName. Apparently you are not interested in the email in your final result.

From the remaining UserNames, I keep only those UserNames that are also in SmallList.

I use the overload of Enumerable.GroupBy that has a parameter resultSelector, so I can manipulate the result.

var result = BigList.GroupBy(

   // keySelector: make groups of users with the same EmailAddress:
   user => user.EmailAddress,

   // resultSelector: from each emailAddress and all Users that have this emailAddress
   // make one new Object, the one that contains the smallest UserName
   (emailAddress, usersWithThisEmailAddress) => usersWithThisEmailAddres
       .Select(user => user.UserName)
       .OrderBy(userName => userName.Length)
       .FirstOrDefault())

// You don't want to keep all UserNames, keep only those that are also in smallList:
.Where(userName => smallList.Contains(userName));

To get the shortest UserName of the users in each group, you could Order them in ascending order of UserName length and take the first one. But why order your 2nd and 3rd, and 54th, if you will only use first one of your sorted sequence.

A method where you will only have to enumerate your sequence once, is the lesser known method Enumerable.Aggregate:

(emailAddress, usersWithThisEmailAddress) => usersWithThisEmailAddres
    .Select(user => user.UserName)
    .Aggregate( (shortestUserName, nextUserName) => 
         (nextUserName.Length < shortestUserName.Length) ? nextUserName : shortestUserName);

Aggregate does something like the following.

IEnumerable<string> userNames = ...
string shortestUserName = userNames.First();
foreach (string nextUserName in userNames.Skip(1))
{
    shortestUserName = (nextUserName.Length < shortestUserName.Length) ?
        nextUserName : shortestUserName;
}
return shortestUserName;

In fact, Aggregate is even a little bit more efficient, by using GetEnumerator and MoveNext. This requires a little bit of knowledge of how to enumerate at its lowest level, if you don't know anything about it, don't worry, you seldom need to use it, usually only if you want to improve performance:

IEnumerable<string> userNames = ...
IEnumerator<string> enumerator = userNames.GetEnumerator();
if (enumerator.MoveNext())
{
    // there is at least one user name in the sequence, it is the shortest until now
    string shortestUserName = enumerator.Current;

    // while there are more userNames, check if the next one is shorter:
    while (enumerator.MoveNext())
    {
        // There is a next user name. Is it shorter?
        shortestUserName = (enumerator.Current.Length < shortestUserName.Length) ?
        enumerator.Current: shortestUserName;
    }
}
// else: there are no elements at all, decide what to do.

And if you want to squeeze the last possible optimization out of it:

while (enumerator.MoveNext())
{
    if (enumerator.Current.Length < shortestUserName.Length)
    {
        shortestUserName = enumerator.Current;
    }
}

Upvotes: 0

Daniel A. White
Daniel A. White

Reputation: 190907

List<T>.Contains doesn't scale well as it has to iterate at most n times (the number of items in that list) thru the list for each item in the BigList. Consider using HashSet<T> instead of List<T> for SmallList

Upvotes: 2

Related Questions