Reputation: 12803
I have filenames which can have an arbitrary number of words/spaces in them. Basically, I need the right syntax to consume any characters in the middle of the string without consuming the last word.
Some problem background- The first word or the last word could be a date that I need to capture. Alternatively, the last word may be initials. I need the date/initials in named captured groups.
Example files,
FileName Expected Capture Groups
-------- ----------------------
Myfile 120101.xls Date: {Myfile, 120101}
120101 MyFile.xls Date: {Myfile, 120101}
MyFile BHO.doc Date: {Myfile} Initials: {BHO}
120101 My file name BHO.docx Date: {120101} Initials: {BHO}
Foo.bar None
WhyDidIUsePeriods.huh.doc None
120101 WhyDidIUsePeriods.huh.doc Date: {WhyDidIUsePeriods, 120101}
WhyDidIUsePeriods BHO.huh.doc Date: {WhyDidIUsePeriods} Initials: {BHO}
120101 WhyDidIUsePeriods BHO.huh.doc Date: {120101} Initials: {BHO}
So far, I have the following Regex:
@"^(?<Date>.+?(?= ))?.*?((?<Initials>(?<= )[^0-9]*?)|(?<Date>(?<= ).*?))?\..*?$"
This works for filenames of two word length, but not for anything larger (the trailing groups capture multiple words). The issue is the .*?
after the first Date capture group. I need this to greedily capture all "interior" words without consuming the last word. I'm thinking negative lookahead, but I'm not sure how to structure it so the pattern both consumes all characters yet doesn't consume characters matching a certain negative lookahead pattern ( .*?\.)
.
(It's ok that the Date capture groups will capture non-dates, there is custom parsing logic for that later on)
Is what I want even possible with a negative lookahead? Is there a better strategy to meet these requirements?
EDIT:
I've illustrated what expected results will be next to each file example. I don't want any more specific Regex for the date because it could be in various non-numerical formats as well.
A Regex is unfortunately necessary, as in some cases, the problematic .*?
will be replaced with more specific patterns (for example, say some files additionally need to contain the word "Foo", a Regex seems like the best tool).
Upvotes: 2
Views: 987
Reputation: 15000
This expression will:
For this I'm using the
^
(?=(?:[^.]*?(?<file>(?<=^)[a-zA-Z\s]*?(?=\s[A-Z]{3}\.|\s)|(?<=\s)[a-zA-Z\s]*?(?=\.|\s[A-Z]{3}\.)))?) # get the file (aka not date and not initials
(?=(?:[^.]*?\s(?<Initials>[A-Z]{3})\.)?) # get the initials
(?=(?:[^.]*?(?<Date>\d+))?) # capture the date value if it exists.
(?=(?<FileName>.*?)\.) # capture entire filename upto but not including the first dot
.*
Sample Text
Myfile 120101.xls
120101 MyFile.xls
MyFile BHO.doc
120101 My file name BHO.docx
Foo.bar
WhyDidIUsePeriods.huh.doc
120101 WhyDidIUsePeriods.huh.doc
WhyDidIUsePeriods BHO.huh.doc
120101 WhyDidIUsePeriods BHO.huh.doc
Code
Regex re = new Regex(@"^(?=(?:[^.]*?(?<file>(?<=^)[a-zA-Z\s]*?(?=\s[A-Z]{3}\.|\s)|(?<=\s)[a-zA-Z\s]*?(?=\.|\s[A-Z]{3}\.)))?)(?=(?:[^.]*?\s(?<Initials>[A-Z]{3})\.)?)(?=(?:[^.]*?(?<Date>\d+))?)(?=(?<FileName>.*?)\.).*",RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline);
MatchCollection mc = re.Matches(sourcestring);
Matches
[0][0] = Myfile 120101.xls
[0][file] = Myfile
[0][Initials] =
[0][Date] = 120101
[0][FileName] = Myfile 120101
[1][0] = 120101 MyFile.xls
[1][file] = MyFile
[1][Initials] =
[1][Date] = 120101
[1][FileName] = 120101 MyFile
[2][0] = MyFile BHO.doc
[2][file] = MyFile
[2][Initials] = BHO
[2][Date] =
[2][FileName] = MyFile BHO
[3][0] = 120101 My file name BHO.docx
[3][file] = My file name
[3][Initials] = BHO
[3][Date] = 120101
[3][FileName] = 120101 My file name BHO
[4][0] = Foo.bar
[4][file] = Foo
[4][Initials] =
[4][Date] =
[4][FileName] = Foo
[5][0] = WhyDidIUsePeriods.huh.doc
[5][file] = WhyDidIUsePeriods
[5][Initials] =
[5][Date] =
[5][FileName] = WhyDidIUsePeriods
[6][0] = 120101 WhyDidIUsePeriods.huh.doc
[6][file] = WhyDidIUsePeriods
[6][Initials] =
[6][Date] = 120101
[6][FileName] = 120101 WhyDidIUsePeriods
[7][0] = WhyDidIUsePeriods BHO.huh.doc
[7][file] = WhyDidIUsePeriods
[7][Initials] = BHO
[7][Date] =
[7][FileName] = WhyDidIUsePeriods BHO
[8][0] = 120101 WhyDidIUsePeriods BHO.huh.doc
[8][file] = WhyDidIUsePeriods
[8][Initials] = BHO
[8][Date] = 120101
[8][FileName] = 120101 WhyDidIUsePeriods BHO
Upvotes: 1