Reputation: 2027

Find Strings with certain Hamming distance LINQ

If we run the following (thanks to @octavioccl for help) LINQ Query:

var result = stringsList
.GroupBy(s => s)
  .Where(g => g.Count() > 1)        
  .OrderByDescending(g => g.Count())
  .Select(g => g.Key);

It gives us all the strings which occur in the list atleast twice (but exactly matched i.e. Hamming Distance =0).

I was just wondering if there is an elegant solution (all solutions I have tried so far either use loops and a counter which is ugly or regex) possible where we can specify the hamming distance in the Where clause to get those strings as well which lie within the specified Hamming Distance range?

P.S: All the strings are of equal length

UPDATE

Really thanks to krontogiannis for his detailed answer. As I mentioned earlier, I want to get list of strings with hamming distance below the given threshold. His code is working perfectly fine for it (Thanks again).

Only thing remaining is to take the strings out of the 'resultset' and insert/add into a `List'

Basically this is what I want:

List<string> outputList = new List<string>();
foreach (string str in patternsList)
            {
                var rs = wordsList
    .GroupBy(w => hamming(w, str))
    .Where(h => h.Key <= hammingThreshold)
    .OrderByDescending(h => h.Key)
    .Select(h => h.Count());
outputList.Add(rs); //I know it won't work but just to show what is needed
            }

Thanks

Upvotes: 2

Answers (4)

Hakan Fıstık

Reputation: 19421

Install System.Numerics.Tensors Package (by Microsoft).

Call it

using System.Numerics.Tensors;

ReadOnlySpan<byte> byte1 = [1, 2, 3, 4, 5];
ReadOnlySpan<byte> byte2 = [1, 2, 3, 4, 5];
int hammingDistance = TensorPrimitives.HammingDistance(byte1, byte2);

Console.WriteLine(hammingDistance); // 0

Here is the documentation page.
Here is the source code.
This library uses SIMD so the performance gain could be huge, especially for large arrays.

Upvotes: 1

z3nth10n

Reputation: 2441

OP asked for Hamming distance, my algorithm uses Levenshtein distance algorithm. But the code is easily transformable.

namespace Program
{
    public static class Utils
    {
        public static string LongestCommonSubstring(this IEnumerable<string> arr)
        {
            // Determine size of the array
            var n = arr.Count();

            // Take first word from array as reference
            var s = arr.ElementAt(0);
            var len = s.Length;

            var res = "";

            for (var i = 0; i < len; i++)
            {
                for (var j = i + 1; j <= len; j++)
                {
                    // generating all possible substrings
                    // of our reference string arr[0] i.e s
                    var stem = s.Substring(i, j - i);
                    var k = 1;
                    //for (k = 1; k < n; k++) {
                    foreach (var item in arr.Skip(1))
                    {
                        // Check if the generated stem is
                        // common to all words
                        if (!item.Contains(stem))
                            break;

                        ++k;
                    }

                    // If current substring is present in
                    // all strings and its length is greater
                    // than current result
                    if (k == n && res.Length < stem.Length)
                        res = stem;
                }
            }

            return res;
        }

        public static HashSet<string> GetShortestGroupedString(this HashSet<string> items, int distanceThreshold = 3, int minimumStringLength = 2)
        {
            var cluster = new Dictionary<int, List<Tuple<string, string>>>();
            var clusterGroups = new HashSet<string>();

            var itemCount = items.Count * items.Count;
            int k = 0;

            var first = items.First();
            var added = "";
            foreach (var item in items)
            //Parallel.ForEach(merged, item => // TODO
            {
                var computed2 = new List<string>();
                foreach (var item2 in items)
                {
                    var distance = LevenshteinDistance.Compute(item, item2);
                    var firstDistance = LevenshteinDistance.Compute(first, item2);

                    if (!cluster.ContainsKey(distance)) // TODO: check false
                        cluster.Add(distance, new List<Tuple<string, string>>());

                    if (distance > distanceThreshold)
                    {
                        ++k;
                        continue;
                    }

                    cluster[distance].Add(new Tuple<string, string>(item, item2));

                    if (firstDistance > distance)
                    {
                        var computed = new List<string>();
                        foreach (var kv in cluster)
                        {
                            if (kv.Value.Count == 0) continue;
                            var longest = kv.Value.Select(dd => dd.Item1).LongestCommonSubstring();
                            if (string.IsNullOrEmpty(longest)) continue;

                            computed.Add(longest);
                        }

                        var currentAdded = computed.OrderBy(s => s.Length).FirstOrDefault();
                        var diff = string.IsNullOrEmpty(added) || string.IsNullOrEmpty(currentAdded)
                            ? string.Empty
                            : currentAdded.Replace(added, string.Empty);

                        if (!string.IsNullOrEmpty(currentAdded) && diff.Length == currentAdded.Length)
                        {
                            var ff = computed2.OrderBy(s => s.Length).FirstOrDefault();
                            if (ff.Length >= minimumStringLength)
                                clusterGroups.Add(ff);

                            computed2.Clear(); // TODO: check false
                            computed2.Add(diff);
                        }
                        else
                        {
                            if (diff.Length == 0 && !string.IsNullOrEmpty(added) && !string.IsNullOrEmpty(currentAdded))
                                computed2.Add(diff);
                        }

                        added = currentAdded;
                        cluster.Clear();
                        first = item;
                    }

                    ++k;
                }

                var f = computed2.OrderBy(s => s.Length).FirstOrDefault();
                if (f.Length >= minimumStringLength)
                    clusterGroups.Add(f);
            }
            //});

            return clusterGroups;
        }
    }

    /// <summary>
    /// Contains approximate string matching
    /// </summary>
    internal static class LevenshteinDistance
    {
        /// <summary>
        /// Compute the distance between two strings.
        /// </summary>
        public static int Compute(string s, string t)
        {
            var n = s.Length;
            var m = t.Length;
            var d = new int[n + 1, m + 1];

            // Step 1
            if (n == 0)
            {
                return m;
            }

            if (m == 0)
            {
                return n;
            }

            // Step 2
            for (var i = 0; i <= n; d[i, 0] = i++)
            {
            }

            for (var j = 0; j <= m; d[0, j] = j++)
            {
            }

            // Step 3
            for (var i = 1; i <= n; i++)
            {
                //Step 4
                for (var j = 1; j <= m; j++)
                {
                    // Step 5
                    var cost = (t[j - 1] == s[i - 1]) ? 0 : 1;

                    // Step 6
                    d[i, j] = Math.Min(
                        Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                        d[i - 1, j - 1] + cost);
                }
            }
            // Step 7
            return d[n, m];
        }
    }
}

The code has two references:

The LevenshteinDistance class was extracted from: https://stackoverflow.com/a/2344347/3286975
The LongestCommonString method was extracted from: https://www.geeksforgeeks.org/longest-common-substring-array-strings/

My code is being reviewed at https://codereview.stackexchange.com/questions/272379/get-shortest-grouped-distinct-string-from-hashset-of-strings so I expect improvements on it.

Upvotes: 1

krontogiannis

Reputation: 1939

Calculating the hamming distance between two strings using LINQ can be done in an elegant way:

Func<string, string, int> hamming = (s1, s2) => s1.Zip(s2, (l, r) => l - r == 0 ? 0 : 1).Sum();

You question is a bit vague about the 'grouping'. As you can see to calculate the hamming distance you need two strings. So you either need to calculate the hamming distance for all the words in your string list vs an input, or calculate the distance between all for the words in your list (or something different that you need to tell us :-) ).

In any way i'll give two examples for input

var words = new[] {
    "hello",
    "rellp",
    "holla",
    "fooba",
    "hempd"
};

Case 1

var input = "hello";
var hammingThreshold = 3;

var rs = words
    .GroupBy(w => hamming(w, input))
    .Where(h => h.Key <= hammingThreshold)
    .OrderByDescending(h => h.Key);

Output would be something like

hempd with distance 3
rellp holla with distance 2
hello with distance 0

Case 2

var hs = words
    .SelectMany((w1, i) => 
        words
            .Where((w2, j) => i > j)
            .Select(w2 => new { Word1 = w1, Word2 = w2 })) // all word pairs except with self
    .GroupBy(pair => hamming(pair.Word1, pair.Word2))
    .Where(g => g.Key <= hammingThreshold)
    .OrderByDescending(g => g.Key);

Output would be something like

(holla, rellp) (fooba, holla) (hempd, hello) with distance 3
(rellp, hello) (holla, hello) with distance 2

Edit To get only the words from the first grouping you can use SelectMany

var output = rs.SelectMany(g => g).ToList();

Upvotes: 4

P. Weyer

Reputation: 1

You could do something like this:

int hammingDistance = 2;
var result = stringsList
.GroupBy(s => s.Substring(0, s.Length - hammingDistance))
.Where(g => g.Count() > 1)
.OrderbyDescending(g => g.Count())
.Select(g => g.Key);

Upvotes: -1

Find Strings with certain Hamming distance LINQ

Answers (4)

Related Questions