reach4thelasers
reach4thelasers

Reputation: 26899

Regex Style Pattern Matching in .NET Object Lists

I've got a collection of .NET objects with various properties. Lets say its a chain of Chromosomes in a genetic code - although the objects data is a little more complex than that. I want to search the list for predefined sequences of objects. I can define objects as a finite number of unique types of interest. R,B,D and in a massive list I want to find certain sequences of objects:

A massively simplified version would be:

public class Chromosome {
    public ChromosomeType CromosomeType { 
       get {
        // Some logic that works out and returns the correct chromosome type

       }
    }
}

public enum ChromosomeType {
  R, B, D
}

So given a large collection of these types. I want to match certain sequences

e.g. "R+B{3}D+"

So in the "regex" above, the following subsequence would be matched in a list: RRRBBBDD

I need to be able to return all matches from a very long list of Objects.

Clearly regex is perfect for this, but I don't actually have strings, I've got collections of objects.

Whats the best way to search a collection of objects for predefined sequences?

Update

Colin's solution is the one I went with in the end. It works great. I updated it to be able to handle multiple matches, and to use Arrays in order to be as fast as possible

Here's the final working solution:

    public static class ChromosomesExtensions
    {
        public static IEnumerable<Chromosome[]> FindBySequence(this Chromosome[] chromosomes, string patternRegex)
        {
            var sequenceString
                = String.Join(
                    String.Empty, //no separator
                    (
                        from c in chromosomes
                        select c.CromosomeType.ToString()
                    )
                );
            MatchCollection matches = Regex.Matches(sequenceString, patternRegex);

            foreach (Match match in matches)
            {
                Chromosome[] subset = new Chromosome[match.Value.Length];

                var j = 0;
                for (var i = match.Index; i < match.Index + match.Length; i++)
                {
                    subset[j++] = chromosomes[i];
                }
                yield return subset;
            }
        }
    }

    [TestFixture]
    public class TestClass
    {
        [Test]
        public void TestMethod()
        {
            var chromosomes =
                new[]
                {
                    new Chromosome(){ CromosomeType = ChromosomeType.D, Id = 1},
                    new Chromosome(){ CromosomeType = ChromosomeType.R, Id = 2 },
                    new Chromosome(){ CromosomeType = ChromosomeType.R, Id = 3 },
                    new Chromosome(){ CromosomeType = ChromosomeType.B, Id = 4 },
                    new Chromosome(){ CromosomeType = ChromosomeType.B, Id = 5 },
                    new Chromosome(){ CromosomeType = ChromosomeType.B, Id = 6 },
                    new Chromosome(){ CromosomeType = ChromosomeType.D, Id = 7 },
                    new Chromosome(){ CromosomeType = ChromosomeType.D, Id = 8 },
                    new Chromosome(){ CromosomeType = ChromosomeType.B, Id = 9 },
                    new Chromosome(){ CromosomeType = ChromosomeType.R, Id = 10 },
                    new Chromosome(){ CromosomeType = ChromosomeType.R, Id = 11 },
                    new Chromosome(){ CromosomeType = ChromosomeType.B, Id = 12 },
                    new Chromosome(){ CromosomeType = ChromosomeType.B, Id = 13 },
                    new Chromosome(){ CromosomeType = ChromosomeType.B, Id = 14 },
                    new Chromosome(){ CromosomeType = ChromosomeType.D, Id = 15 },
                    new Chromosome(){ CromosomeType = ChromosomeType.D, Id = 16 },
                    new Chromosome(){ CromosomeType = ChromosomeType.R, Id = 17 },
                    new Chromosome(){ CromosomeType = ChromosomeType.R, Id = 18 },
                    new Chromosome(){ CromosomeType = ChromosomeType.B, Id = 19 },
                    new Chromosome(){ CromosomeType = ChromosomeType.B, Id = 20 },
                    new Chromosome(){ CromosomeType = ChromosomeType.B, Id = 21 },
                    new Chromosome(){ CromosomeType = ChromosomeType.D, Id = 22 },
                    new Chromosome(){ CromosomeType = ChromosomeType.D, Id = 23 },
                };

            var matchIndex = 0;
            foreach (Chromosome[] match in chromosomes.FindBySequence("R+B{3}D+"))
            {
                Console.WriteLine($"Match {++matchIndex}");
                var result = new String(match.SelectMany(x => string.Join("", $"id: {x.Id} Type: {x.CromosomeType.ToString()}\n")).ToArray());
                Console.WriteLine(result);
            }

        }
    }

Output:

    Match 1
id: 2 Type: R
id: 3 Type: R
id: 4 Type: B
id: 5 Type: B
id: 6 Type: B
id: 7 Type: D
id: 8 Type: D

Match 2
id: 10 Type: R
id: 11 Type: R
id: 12 Type: B
id: 13 Type: B
id: 14 Type: B
id: 15 Type: D
id: 16 Type: D

Match 3
id: 17 Type: R
id: 18 Type: R
id: 19 Type: B
id: 20 Type: B
id: 21 Type: B
id: 22 Type: D
id: 23 Type: D

Upvotes: 2

Views: 552

Answers (3)

eocron
eocron

Reputation: 7526

Maybe too late, but I have to answer. I implemented ORegex engine (not for your problem, but very close to yours). It's use is pretty much like you use Regex and will perfectly solve your problem with pattern matching on arbitary colletion. It is much faster than method above, and you can even pass some very complex condition fucntions (for example, check some object properties too). It is totaly free and available through Nuget.

Example with prime sequences: How to search patterns in arbitrary sequences?

Project page with more examples:https://github.com/eocron/ORegex

Upvotes: 1

James Fegan
James Fegan

Reputation: 129

Colin's answer seems to get you close to where you want to be. I have two thoughts to add:

  1. Do you really need to pull in "RegEx" to accomplish the task? You're using a subset of the RegEx library for the quantifiers, but that is at the expense of adding a dependency to a complex tool. You may have a more portable application if you just make your own simple (albeit less flexible) syntax.

  2. I would consider avoiding ToString, and simply give your objects a const string property that you could use to bounce the RegEx off of. If you're dealing with 'massive' amounts of data, it seems like calling ToString() everywhere would give you quite a bit of overhead.

Upvotes: 1

Colin
Colin

Reputation: 4135

A simple, clean way using extension methods (that actually supports searching via Regex).

Classes:

public static class ChromosomesExtensions
{
    public static IEnumerable<Chromosome> FindBySequence(this IEnumerable<Chromosome> chromosomes, string patternRegex)
    {
        var sequenceString
            = String.Join(
                String.Empty, //no separator
                (
                    from c in chromosomes
                    select c.CromosomeType.ToString()
                )
            );
        var match = Regex.Match(sequenceString, patternRegex);
        //returns empty if no match is found
        return chromosomes.ToList().GetRange(sequenceString.IndexOf(match.Value), match.Value.Length);
    }
}

Usage:

var chromosomes =
    new[]
    {
        new Chromosome(){ CromosomeType = ChromosomeType.D },
        new Chromosome(){ CromosomeType = ChromosomeType.R },
        new Chromosome(){ CromosomeType = ChromosomeType.R },
        new Chromosome(){ CromosomeType = ChromosomeType.B },
        new Chromosome(){ CromosomeType = ChromosomeType.B },
        new Chromosome(){ CromosomeType = ChromosomeType.B },
        new Chromosome(){ CromosomeType = ChromosomeType.D },
        new Chromosome(){ CromosomeType = ChromosomeType.D },
        new Chromosome(){ CromosomeType = ChromosomeType.B },
    };

var queryResult = chromosomes.FindBySequence("R+B{3}D+");

Upvotes: 2

Related Questions