Joan Venge
Joan Venge

Reputation: 331032

How to match 2 lists most effectively (fast)?

I have 2 lists<string> of items, source and target. The items in the source list will have 0 to n matches in the target list, but there will not be duplicate matches.

Considering both lists are sorted, how would you do the matching most effectively in terms of performance.

Example:

source = {"1", "2", "A", "B", ...}
target = {"1 - new music", "1 / classic", "1 | pop", "2 edit", "2 no edit", "A - sing", "B (listen)", ...}

Basically the match is simple prefix match, but say you have a method called MatchName. You can use a new function if you are gonna do more optimized searches. NameMatch just compares 2 strings and returns a bool.

In the end source[0] will have source[0].Matches to contain target[0, 1 and 2] in this case.

Upvotes: 4

Views: 2770

Answers (7)

Mike Dunlavey
Mike Dunlavey

Reputation: 40669

Since they are sorted, isn't it just a basic O(N) merge loop?

ia = ib = 0;
while(ia < na && ib < nb){
  if (A[ia] < B[ib]){
    // A[ia] is unmatched
    ia++;
  }
  else if (B[ib] < A[ia]){
    // B[ib] is unmatched
    ib++;
  }
  else {
    // A[ia] matches B[ib]
    ia++;
    ib++;
  }
}
while(ia < na){
  // A[ia] is unmatched
  ia++;
}
while(ib < nb){
  // B[ib] is unmatched
  ib++;
}

Upvotes: 1

Thorarin
Thorarin

Reputation: 48476

I'm not sure this is worth attempting to optimize very far. You could implement some kind of binary search with this, but it's effectiveness would be rather limited. How many elements are we talking about?

Without unmatched elements in target

Assuming the lists are sorted, and no elements in target exist that cannot be matched against source:

static List<string>[] FindMatches(string[] source, string[] target)
{
    // Initialize array to hold results
    List<string>[] matches = new List<string>[source.Length];
    for (int i = 0; i < matches.Length; i++)
        matches[i] = new List<string>();

    int s = 0;
    for (int t = 0; t < target.Length; t++)
    {
        while (!MatchName(source[s], target[t]))
        {
            s++;
            if (s >= source.Length)
                return matches;
        }

        matches[s].Add(target[t]);
    }

    return matches;
}

With unmatched elements

If there is a possibility of elements existing in target that have no match in source, the above will break (if the elements are not at the end of target). To fix that, it would be best to go with a different implementation for the comparison. Instead of a boolean, we need it to return 'less than', 'equal' or 'greater than', like a Comparer when used in sorting:

static List<string>[] FindMatches(string[] source, string[] target)
{
    // Initialize array to hold results
    List<string>[] matches = new List<string>[source.Length];
    for (int i = 0; i < matches.Length; i++)
        matches[i] = new List<string>();

    int s = 0;
    for (int t = 0; t < target.Length; t++)
    {
        int m = CompareName(source[s], target[t]);
        if (m == 0)
        {
            matches[s].Add(target[t]);
        }
        else if (m > 0)
        {
            s++;
            if (s >= source.Length)
                return matches;
            t--;
        }
    }

    return matches;
}

static int CompareName(string source, string target)
{
    // Whatever comparison you need here, this one is really basic :)
    return target[0] - source[0];
}

The two are otherwise essentially the same. As you can see, you loop through the target elements once, advancing the index to the source array when you no longer find a match.

If the number of source elements is limited, it might be worth doing a little smarter search. If the number of source elements is also large, the supposed benefit of that would decrease.

Then again, the first algorithm takes 0.18 seconds with 1 million target elements on my machine, in Debug mode. The second one is even faster (0.03 seconds), but that because of the simpler comparison that is being done. Possibly you would have to compare everything up to the first whitespace character, making it significantly slower.

Upvotes: 3

Dykam
Dykam

Reputation: 10290

Edited, rewritten, nontested, should have O(source + target) performance. Usage could be MatchMaker.Match(source, target).ToList();

public static class MatchMaker
{
    public class Source
    {
        char Term { get; set; }
        IEnumerable<string> Results { get; set; }
    }

    public static IEnumerable<Source> Match(IEnumerable<string> source, IEnumerable<string> target)
    {
        int currentIndex = 0;
        var matches = from term in source
                      select new Source
                      {
                          Term = term[0],
                          Result = from result in target.FromIndex(currentIndex)
                                       .TakeWhile((r, i) => {
                                           currentIndex = i;
                                           return r[0] == term[0];
                                       })
                                   select result
                      };
    }
    public static IEnumerable<T> FromIndex<T>(this IList<T> subject, int index)
    {
        while (index < subject.Count) {
            yield return subject[index++];
        }
    }
}

A simple LinQ, probably not the uttermost fast, but the clearest:

var matches = from result in target
              from term in source
              where result[0] == term[0]
              select new {
              Term: term,
              Result: result
              };

I'm against premature optimalization.

Upvotes: 1

Andrew Barrett
Andrew Barrett

Reputation: 19911

Here is an answer that only loops through both lists once, using the logic that both are sorted as an optimisation. Like most have said, I wouldn't worry too much about optimisations as it's likely to be fast enough with any of these answers, I would go for the most readable and maintainable solution.

That being said, I need something to do with my cup of coffee so here you go. One of the advantages of the below is that it allows things in the target list that have no matches in the source list, although I'm unsure if you'd require that functionality.

class Program
{
    public class Source
    {
        private readonly string key;
        public string Key { get { return key;}}

        private readonly List<string> matches = new List<string>();
        public List<string> Matches { get { return matches;} }

        public Source(string key)
        {
            this.key = key;
        }
    }

    static void Main(string[] args)
    {
        var sources = new List<Source> {new Source("A"), new Source("C"), new Source("D")};
        var targets = new List<string> { "A1", "A2", "B1", "C1", "C2", "C3", "D1", "D2", "D3", "E1" };

        var ixSource = 0;
        var currentSource = sources[ixSource++];

        foreach (var target in targets)
        {
            var compare = CompareSourceAndTarget(currentSource, target);

            if (compare > 0)
                continue;

            // Try and increment the source till we have one that matches 
            if (compare < 0)
            {
                while ((ixSource < sources.Count) && (compare < 0))
                {
                    currentSource = sources[ixSource++];
                    compare = CompareSourceAndTarget(currentSource, target);
                }
            }

            if (compare == 0)
            {
                currentSource.Matches.Add(target);
            }

            // no more sources to match against
            if ((ixSource > sources.Count))
                break;
        }

        foreach (var source in sources)
        {
            Console.WriteLine("source {0} had matches {1}", source.Key, String.Join(" ", source.Matches.ToArray()));
        }
    }

    private static int CompareSourceAndTarget(Source source, string target)
    {
        return String.Compare(source.Key, target.Substring(0, source.Key.Length), StringComparison.OrdinalIgnoreCase);
    }
}

Upvotes: 1

Guffa
Guffa

Reputation: 700342

As the items are sorted, you can just loop though the lists:

string[] source = {"1", "2", "A", "B" };
string[] target = { "1 - new music", "1 / classic", "1 | pop", "2 edit", "2 no edit", "A - sing", "B (listen)" };

List<string>[] matches = new List<string>[source.Length];
int targetIdx = 0;
for (int sourceIdx = 0; sourceIdx < source.Length; sourceIdx++) {
   matches[sourceIdx] = new List<string>();
   while (targetIdx < target.Length && NameMatch(source[sourceIdx], target[targetIdx])) {
      matches[sourceIdx].Add(target[targetIdx]);
      targetIdx++;
   }
}

Upvotes: 1

pjp
pjp

Reputation: 17629

Well you obviously stop looping over the target list as soon as you go past the current source prefix. In this case you be better of with a prefix method than the matches one so you can tell what the current prefix is and stop searching the target if you go past it.

Upvotes: 0

Arpit Tambi
Arpit Tambi

Reputation: 1194

I think the best way would be to prepare an index. Like this (Javascript)

index = [];
index["1"] = [0,1,2];
index["2"] = [3,4];

Well sorted lists aren't really required in this case.

Upvotes: 0

Related Questions