bmdprasad
bmdprasad

Reputation: 65

Two huge List<string> data should be compared using C# and must ensure performance improves

I have two List<string> that contain huge data, and I have the following code which I used till now. But I have some doubts regarding this, due to lot of confusion about the data is compared or not in the list items. Here I am using sequence Equal to compare the data. I have two questions. Somewhere I found that SequenceEqual will compare the data in the lists. So used it.

  1. Will SequenceEqual compares the data in the lists?

  2. Better way of code to improve performance. As per understanding I kept only three items in both the lists but our requirement has huge data items in the lists. So need to improve performance.

    bool value = false;
    List<string> list1 = new List<string>();
    list1.Add("one");
    list1.Add("two");
    list1.Add("three");
    
    List<string> list2 = new List<string>();
    list2.Add("one");
    list2.Add("two");
    list2.Add("three");
    list1.Sort();
    list2.Sort();
    if (list1.SequenceEqual(list2))
    {
        value = true;
    }
    else
    {
        value = false;
    }
    return value;
    

Upvotes: 1

Views: 250

Answers (3)

Theodor Zoulias
Theodor Zoulias

Reputation: 43429

Your method of comparing the two lists is essentially this:

static bool SequenceUnorderedEqual<T>(
    IEnumerable<T> source1, IEnumerable<T> source2,
    IComparer<T> comparer = default)
{
    List<T> list1 = source1.ToList();
    List<T> list2 = source2.ToList();
    list1.Sort(comparer);
    list2.Sort(comparer);
    return list1.SequenceEqual(list2);
}

Using a dictionary is at least ten times faster, at the cost of being a bit more complex. It also doesn't require the items to be comparable (they only have to be equitable).

using System.Runtime.InteropServices;

//...

/// <summary>
/// Determines whether two sequences are equal according to an equality comparer,
/// ignoring the order of the items in the sequences.
/// </summary>
static bool SequenceUnorderedEqual<T>(
    IEnumerable<T> source1, IEnumerable<T> source2,
    IEqualityComparer<T> comparer = default)
{
    Dictionary<T, int> occurences;
    if (Enumerable.TryGetNonEnumeratedCount(source1, out int count1) &&
        Enumerable.TryGetNonEnumeratedCount(source2, out int count2))
    {
        if (count1 != count2) return false;
        occurences = new(count1, comparer);
    }
    else
    {
        occurences = new(comparer);
    }

    // Populating the dictionary using source1.
    foreach (T item in source1)
    {
        CollectionsMarshal
            .GetValueRefOrAddDefault(occurences, item, out _)++;
    }

    // Depopulating the dictionary using source2.
    foreach (T item in source2)
    {
        ref int count = ref CollectionsMarshal
            .GetValueRefOrNullRef(occurences, item);
        if (Unsafe.IsNullRef(ref count) || count == 0) return false;
        count--;
    }

    // Now all counters should be zero.
    foreach (int count in occurences.Values) if (count != 0) return false;
    return true;
}

A simpler but less efficient implementation that doesn't take advantage of the CollectionsMarshal APIs, can be found in the 2nd revision of this answer.

Upvotes: 1

Max
Max

Reputation: 504

As I understand it, using the SequenceEqual method from Linq will be very efficient. It effectively returns true if, quote, "the two source sequences are of equal length and their corresponding elements are equal according to the default equality comparer for their type", otherwise false. There won't really be a much faster way of doing this comparision.

Upvotes: 0

Enigmativity
Enigmativity

Reputation: 117027

Here is the implementation of SequenceEqual:

public static bool SequenceEqual<TSource>(IEnumerable<TSource> first, IEnumerable<TSource> second, IEqualityComparer<TSource> comparer)
{
    if (comparer == null)
    {
        comparer = EqualityComparer<TSource>.Default;
    }
    if (first == null)
    {
        throw Error.ArgumentNull("first");
    }
    if (second == null)
    {
        throw Error.ArgumentNull("second");
    }
    using (IEnumerator<TSource> enumerator = first.GetEnumerator())
    {
        using (IEnumerator<TSource> enumerator2 = second.GetEnumerator())
        {
            while (enumerator.MoveNext())
            {
                if (!enumerator2.MoveNext() || !comparer.Equals(enumerator.Current, enumerator2.Current))
                {
                    return false;
                }
            }
            if (enumerator2.MoveNext())
            {
                return false;
            }
        }
    }
    return true;
}

It does not check length and simply traverses both lists to confirm if they are equal.

If you check two lists that are 1_000_000 and 1_000_001 in length then it is terribly inefficient.

So check the lengths first, then call SequenceEqual only if they are the same.

If you're checking for set equality rather than position equality then make a HashSet<string> out of each and then check against those.

Upvotes: 1

Related Questions