Jerry
Jerry

Reputation: 4408

Regular expression to match consecutive numbers

I am trying to extract parts from the following string with consecutive numbers:

word 7, word 8, word 9, word 14

So I get:

word 7, word 8, word 9
word 14

using regular expressions. What I did was to use (word (?<num>\d+),?\s*)+ and then check the numbers for each capture.

Is it possible to have a regular expression to directly extract only parts with consecutive numbers?

Upvotes: 0

Views: 941

Answers (4)

Tom Blodget
Tom Blodget

Reputation: 20772

LINQ is pretty handy for sequences of all kinds. It has many useful operators but you can also define your own. Here's how you could use it:

   "word 10, word 11, word 7, word 8, word 9, word 14, word 2"
        .Split( new [] {", "}, StringSplitOptions.RemoveEmptyEntries)
        .ToPartitionsOfConsecutiveValues(w => Int32.Parse(w.Split(' ').Last()))
        .Select(sequence => String.Join(", ", sequence))
        .ToArray()
        .Dump("Array of strings");

Dump is from LINQPad.

Here's the new operator:

public static class Partition {

    public static IEnumerable<List<T>> ToPartitionsOfConsecutiveValues<T>(
        this IEnumerable<T> source, 
        Func<T,int> valueSelector)
    {
        var lastValue = (int?)null;
        List<T> lastList = null;    
        foreach (var item in source) 
        {
            var value = valueSelector(item);
            if (!(lastValue.HasValue)) 
            {
                lastList = new List<T>();
            }
            else if (lastValue.Value != value - 1)            
            {
                yield return lastList;
                lastList = new List<T>();
            }
            lastValue = value;
            lastList.Add(item);
        }
        if (lastValue.HasValue) yield return lastList;
    }
}

Update based on comment by @L.B.

LINQ operators are most useful when then have as few concrete types as possible. Pulling out the predicate that uses the item type (int) allows the operator to be used in other cases.

Here is the same example:

Func<String,Int32> IntSuffix = w => Int32.Parse(w.Split(' ').Last());
Func<String, String, Boolean> breakPredicate 
    = (prev, next) => IntSuffix(prev) != IntSuffix(next) - 1;
s.Split( new [] {", "}, StringSplitOptions.RemoveEmptyEntries)
    .ToPartitionsOfSequences(breakPredicate)
    .Select (sequence => String.Join(", ", sequence))

The implementation:

public static IEnumerable<List<T>> ToPartitionsOfSequences<T>(
    this IEnumerable<T> source, 
    Func<T, T, Boolean> breakPredicate)
{
    T lastItem = default(T);
    List<T> lastList = null;    
    foreach (var item in source) 
    {
        if (lastList == null) 
        {
            lastList = new List<T>();
        }
        else if (breakPredicate(lastItem, item))
        {
            yield return lastList;
            lastList = new List<T>();
        }
        lastItem = item;
        lastList.Add(item);
    }
    if (lastList != null) yield return lastList;
}

Upvotes: 1

Guilherme Agostinelli
Guilherme Agostinelli

Reputation: 1582

Alternatively, you could use:

        string words = "word 7, word 8, word 9, word 14";
        string[] splittedWords = Regex.Split(words, ", "); //Separating words.

        List<string> sortedWords = new List<string>();

        int currentWordNumber = 0, lastWordNumber = 0;
        foreach (string sptw in splittedWords)
        {
            if (sortedWords.Count == 0) //No value has been written to the list yet, so:
            {
                sortedWords.Add(sptw);
                lastWordNumber = int.Parse(sptw.Split(' ')[1]); //Storing the number of the word for checking it later.
            }
            else
            {
                currentWordNumber = int.Parse(sptw.Split(' ')[1]);

                if (currentWordNumber == lastWordNumber + 1)
                    sortedWords[sortedWords.Count - 1] += ", " + sptw;
                else
                    sortedWords.Add(sptw);

                lastWordNumber = currentWordNumber; //Storing the number of the word for checking it later.
            }
        }

At the end, the list sortedWords will have:

"word 7, word 8, word 9"
"word 14"

Upvotes: 1

MisterMetaphor
MisterMetaphor

Reputation: 6008

It is not possible to do using only regular expressions, as regular expressions can only describe regular languages.

Regular languages, among other limitations, don't allow defining context, which in your case would be the latest met number in your string.

For more info on language and grammar theory, see the Chomsky hierarchy.

Upvotes: 1

Brad Rem
Brad Rem

Reputation: 6026

Since non-RegEx solutions are acceptable:

var data = "word 7, word 8, word 9, word 14";

// split the data into word and number
var dataCollection = data.Split(',').Select (d => new 
{ 
    word = d.Trim().Split(' ')[0], 
    number = int.Parse(d.Trim().Split(' ')[1]) 
}).ToList();

// store each set of consective results into a collection
List<string> resultsCollection = new List<string>();
var sb = new StringBuilder();
int i = 0;
while(i < dataCollection.Count ())
{
    if(i > 0)
    {
       if(dataCollection[i].number == dataCollection[i-1].number + 1)
       {
           if(sb.Length > 0) sb.Append(", ");
       }
       else
       {
          resultsCollection.Add(sb.ToString());
          sb.Clear();
       }
    }
    sb.AppendFormat("{0} {1}", dataCollection[i].word, dataCollection[i].number);
    i++;
}
resultsCollection.Add(sb.ToString());

For your test data, resultsCollection will contain two items:

word 7, word 8, word 9

word 14

Upvotes: 1

Related Questions