Rand Random
Rand Random

Reputation: 7440

RegEx - Version in filename

I have the following schema for files that I want to parse with RegEx

[Custom/Random Name]_[MainVersion]_[MinorVersion].xls

Currently I have the following RegEx (which fails)

(?<firstPart>.+)_(?<mainVersion>\d+)(|_(?<minorVersion>\d+))\.xls

With this when the sample string is

Hello World_22_1.xls

it results in:

match.Groups["firstPart"].Value == "Hello World_22"
match.Groups["mainVersion"].Value == "1"
match.Groups["minorVersion"].Value == ""

but it should be

match.Groups["firstPart"].Value == "Hello World"
match.Groups["mainVersion"].Value == "22"
match.Groups["minorVersion"].Value == "1"

The problem is the my RegEx for the "firstPart" allows anycharacter with ".+" (which includes the "_") so it goes on till the last occurence of "_", there for I could rewrite my RegEx like this

(?<firstPart>[^_]+)_(?<mainVersion>\d+)(|_(?<minorVersion>\d+))\.xls

But this RegEx will than fail if the fileName is this:

Hello_World_22_1.xls

Resulting in:

match.Groups["firstPart"].Value == "World"
match.Groups["mainVersion"].Value == "22"
match.Groups["minorVersion"].Value == "1"

Is there a way to validate the string backwards, since the thing I am looking for is always at the end of the fileName?

The RegEx should return the correct value for those strings (for simplicity I have written the desired result into the braces with [firstPart]/[mainVersion]/[minorVersion])

Hello World_22_1.xls (Hello World/22/1)
Hello_World_22_1.xls (Hello_World/22/1)
Hello_World_22.xls (Hello_World/22/)
Hello_1_World_22_1.xls (Hello_1_World/22/1)
Hello_1_World_22.xls (Hello_1_World/22/)
Hello_33_2_World_22_1.xls (Hello_33_2_World/22/1)
Hello_22_1_World.xls (//) --> (Wouldnt mind if the your solutions would return Hello_22_1_World as firstPart)
33_22.xls (33/22/)
33_22_1.xls (33/22/1)

Played around with reversing the inputed string but this "solution" is very questionable

static void Main(string[] args)
{
    Console.WriteLine(TestRegEx("Hello World_22_1.xls", "Hello World", "22", "1"));
    Console.WriteLine(TestRegEx("Hello_World_22_1.xls", "Hello_World", "22", "1"));
    Console.WriteLine(TestRegEx("Hello_World_22.xls", "Hello_World", "22", ""));
    Console.WriteLine(TestRegEx("Hello_1_World_22_1.xls", "Hello_1_World", "22", "1"));
    Console.WriteLine(TestRegEx("Hello_1_World_22.xls", "Hello_1_World", "22", ""));
    Console.WriteLine(TestRegEx("Hello_33_2_World_22_1.xls", "Hello_33_2_World", "22", "1"));
    Console.WriteLine(TestRegEx("Hello_22_1_World.xls", "", "", ""));
    Console.WriteLine(TestRegEx("33_22.xls", "33", "22", ""));
    Console.WriteLine(TestRegEx("33_22_1.xls", "33", "22", "1"));

    Console.ReadLine();
}

private static bool TestRegEx(string str, string firstPart, string mainVersion, string minorVersion)
{
    var regEx = new Regex("slx\\.((?<minorVersion>\\d+)_|)(?<mainVersion>\\d+)_(?<firstPart>.+)");
    var reverseStr = new string(str.Reverse().ToArray());

    var match = regEx.Match(reverseStr);
    var x1 = new string(match.Groups["firstPart"].Value.Reverse().ToArray());
    var x2 = new string(match.Groups["mainVersion"].Value.Reverse().ToArray());
    var x3 = new string(match.Groups["minorVersion"].Value.Reverse().ToArray());

    return x1 == firstPart && x2 == mainVersion && x3 == minorVersion;
}

Upvotes: 2

Views: 114

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626929

The main trouble is certainly the greedy dot pattern at the beginning that grabs the whole input at first, and then backtracking only yields the last digits. To be able to use optional groups and get their contents if there are any, you need to use lazy quantifier with the dot matching pattern.

I suggest using

(?<firstPart>.+?)(?:_(?<mainVersion>\d+)(?:_(?<minorVersion>\d+))?)?\.xls

See the regex demo

Details:

  • (?<firstPart>.+?) - Group "firstPart" that matches any 0+ chars as few as possible due to the lazy +? quantifier
  • (?:_(?<mainVersion>\d+)(?:_(?<minorVersion>‌​\d+))?)? - 1 or 0 occurrences of:
    • _(?<mainVersion>\d+) - a _ and the "mainVersion" group capturing 1 or more digits
    • (?:_(?<minorVersion>‌​\d+))? - an optional sequence of
      • _ - an underscore
      • (?<minorVersion>‌​\d+) - a "minorVersion" group capturing 1+ digits
  • \.xls - a .xls substring.

enter image description here

I'd prefer this to (?<firstPart>.+?)_(?<mainVersion>\d+)(?:_(?<minorVersion>\d+‌​))?\.xls regex because the latter won't match Hello_22_1_World.xls at all. If you do not need to match it, this last expression may be preferable.

Upvotes: 2

Ashkan Mobayen Khiabani
Ashkan Mobayen Khiabani

Reputation: 34160

Use this:

^(?<firstPart>.+?)_(?<mainVersion>\d+)_(?<minorVersion>\d+)\.xls$

Here is the DEMO

Upvotes: 0

Related Questions