Chris
Chris

Reputation: 7611

Finding similar records in LINQ

I have the following LINQ query, which will be used to find any consignments that are 'similar':

from c in cons
group c by new { c.TripDate.Value, c.DeliveryPostcode, c.DeliveryName } into cg
let min = cg.Min(a => a.DeliverFrom)
let max = cg.Max(a => a.DeliverFrom)
let span = max - min
where span.TotalMinutes <= 59
select cg;

The main thing is the min, max and span. Basically, any consignments that are in the 'group', that have a DeliverFrom datetime within 59 minutes of any other one in the group, will be returned in the group.

The code above looked good originally to me, but upon further inspection it seems that if there's more than 2 records in the group - 2 with DeliverFrom dates 59 minutes of each other, and one with a DeliverFrom date not within 59 minutes of any, then the query would not return that group, as it'll be selecting the min and the max and seeing that the difference is more than 59 minutes. What I want to happen is to see that there are 2 consignments in the group with DeliverFrom dates close enough, and just select a group containing them two.

How would I go about doing this?

EDIT: Doh, another clause has been added in this. There's a field called 'Weight' and one called 'Spaces', each group can have a max of 26 Weight and 26 Spaces

Upvotes: 3

Views: 191

Answers (2)

Jon
Jon

Reputation: 437404

I 'd group the consignments with the same logic like you do but use this overload of GroupBy instead, allowing me to project each group of consigments into another type. This type would here be an enumerable sequence of groups of consigments, each element in which represents consignments that non only were in the same group to begin with, but also should all be delivered within the duration of an hour. So the signature of resultSelector would be

Func<anontype, IEnumerable<Consignment>, IEnumerable<IEnumerable<Consignment>>>

At this point it becomes clear that it would probably be a good idea to define a type for the grouping so that you can get rid of the anonymous type in the above signature; otherwise you 'd be forced to define your resultSelector as a lambda.

Within resultSelector, you need to first of all sort the incoming group of consignments by DeliverFrom and then return sub-groups based on that time. So it might look like this:

IEnumerable<IEnumerable<Consignment>>
Partitioner(ConsignmentGroupKey key, IEnumerable<Consignment> cg)
{
    cg = cg.OrderBy(c => c.DeliverFrom);
    var startTime = cg.First().DeliverFrom;
    var subgroup = new List<Consignment>();

    foreach(var cons in cg) {
        if ((cons.DeliverFrom - startTime).TotalMinutes < 60) {
            subgroup.Add(cons);
        }
        else {
            yield return subgroup;
            startTime = cons.DeliverFrom;
            subgroup = new List<Consignment>() { cons };
        }
    }

    if (subgroup.Count > 0) {
        yield return subgroup;
    }
}

I haven't tried this, but as far as I can tell it should work.

Upvotes: 1

Chris Shain
Chris Shain

Reputation: 51339

If I'm not mistaken, what you are looking for is a statistical problem called cluster identification, and if so it is a far more complex problem than you might think.

As a thought exercise, imagine if you had 3 entries, at 1:00, 1:30, and 2:00. How would you want to group these? Either the first two or the last two would work as a group (less than 59 minutes apart), but all 3 would not.

If you just want to keep chaining items together into a group as long as they are within 59 minutes of any other item in the group, you'd need to keep iterating until you stop finding new items to add to any cluster.

Upvotes: 3

Related Questions