Reputation: 7440
I have the following schema for files that I want to parse with RegEx
[Custom/Random Name]_[MainVersion]_[MinorVersion].xls
Currently I have the following RegEx (which fails)
(?<firstPart>.+)_(?<mainVersion>\d+)(|_(?<minorVersion>\d+))\.xls
With this when the sample string is
Hello World_22_1.xls
it results in:
match.Groups["firstPart"].Value == "Hello World_22"
match.Groups["mainVersion"].Value == "1"
match.Groups["minorVersion"].Value == ""
but it should be
match.Groups["firstPart"].Value == "Hello World"
match.Groups["mainVersion"].Value == "22"
match.Groups["minorVersion"].Value == "1"
The problem is the my RegEx for the "firstPart" allows anycharacter with ".+" (which includes the "_
") so it goes on till the last occurence of "_
", there for I could rewrite my RegEx like this
(?<firstPart>[^_]+)_(?<mainVersion>\d+)(|_(?<minorVersion>\d+))\.xls
But this RegEx will than fail if the fileName is this:
Hello_World_22_1.xls
Resulting in:
match.Groups["firstPart"].Value == "World"
match.Groups["mainVersion"].Value == "22"
match.Groups["minorVersion"].Value == "1"
Is there a way to validate the string backwards, since the thing I am looking for is always at the end of the fileName?
The RegEx should return the correct value for those strings (for simplicity I have written the desired result into the braces with [firstPart]/[mainVersion]/[minorVersion])
Hello World_22_1.xls (Hello World/22/1)
Hello_World_22_1.xls (Hello_World/22/1)
Hello_World_22.xls (Hello_World/22/)
Hello_1_World_22_1.xls (Hello_1_World/22/1)
Hello_1_World_22.xls (Hello_1_World/22/)
Hello_33_2_World_22_1.xls (Hello_33_2_World/22/1)
Hello_22_1_World.xls (//) --> (Wouldnt mind if the your solutions would return Hello_22_1_World as firstPart)
33_22.xls (33/22/)
33_22_1.xls (33/22/1)
Played around with reversing the inputed string but this "solution" is very questionable
static void Main(string[] args)
{
Console.WriteLine(TestRegEx("Hello World_22_1.xls", "Hello World", "22", "1"));
Console.WriteLine(TestRegEx("Hello_World_22_1.xls", "Hello_World", "22", "1"));
Console.WriteLine(TestRegEx("Hello_World_22.xls", "Hello_World", "22", ""));
Console.WriteLine(TestRegEx("Hello_1_World_22_1.xls", "Hello_1_World", "22", "1"));
Console.WriteLine(TestRegEx("Hello_1_World_22.xls", "Hello_1_World", "22", ""));
Console.WriteLine(TestRegEx("Hello_33_2_World_22_1.xls", "Hello_33_2_World", "22", "1"));
Console.WriteLine(TestRegEx("Hello_22_1_World.xls", "", "", ""));
Console.WriteLine(TestRegEx("33_22.xls", "33", "22", ""));
Console.WriteLine(TestRegEx("33_22_1.xls", "33", "22", "1"));
Console.ReadLine();
}
private static bool TestRegEx(string str, string firstPart, string mainVersion, string minorVersion)
{
var regEx = new Regex("slx\\.((?<minorVersion>\\d+)_|)(?<mainVersion>\\d+)_(?<firstPart>.+)");
var reverseStr = new string(str.Reverse().ToArray());
var match = regEx.Match(reverseStr);
var x1 = new string(match.Groups["firstPart"].Value.Reverse().ToArray());
var x2 = new string(match.Groups["mainVersion"].Value.Reverse().ToArray());
var x3 = new string(match.Groups["minorVersion"].Value.Reverse().ToArray());
return x1 == firstPart && x2 == mainVersion && x3 == minorVersion;
}
Upvotes: 2
Views: 114
Reputation: 626929
The main trouble is certainly the greedy dot pattern at the beginning that grabs the whole input at first, and then backtracking only yields the last digits. To be able to use optional groups and get their contents if there are any, you need to use lazy quantifier with the dot matching pattern.
I suggest using
(?<firstPart>.+?)(?:_(?<mainVersion>\d+)(?:_(?<minorVersion>\d+))?)?\.xls
See the regex demo
Details:
(?<firstPart>.+?)
- Group "firstPart" that matches any 0+ chars as few as possible due to the lazy +?
quantifier(?:_(?<mainVersion>\d+)(?:_(?<minorVersion>\d+))?)?
- 1 or 0 occurrences of:
_(?<mainVersion>\d+)
- a _
and the "mainVersion" group capturing 1 or more digits(?:_(?<minorVersion>\d+))?
- an optional sequence of
_
- an underscore(?<minorVersion>\d+)
- a "minorVersion" group capturing 1+ digits\.xls
- a .xls
substring.I'd prefer this to (?<firstPart>.+?)_(?<mainVersion>\d+)(?:_(?<minorVersion>\d+))?\.xls
regex because the latter won't match Hello_22_1_World.xls
at all. If you do not need to match it, this last expression may be preferable.
Upvotes: 2
Reputation: 34160
Use this:
^(?<firstPart>.+?)_(?<mainVersion>\d+)_(?<minorVersion>\d+)\.xls$
Here is the DEMO
Upvotes: 0