MgSam
MgSam

Reputation: 12803

Regex capture first/last word in a filename

I have filenames which can have an arbitrary number of words/spaces in them. Basically, I need the right syntax to consume any characters in the middle of the string without consuming the last word.

Some problem background- The first word or the last word could be a date that I need to capture. Alternatively, the last word may be initials. I need the date/initials in named captured groups.

Example files,

FileName                                      Expected Capture Groups
--------                                      ----------------------
Myfile 120101.xls                             Date: {Myfile, 120101}
120101 MyFile.xls                             Date: {Myfile, 120101}
MyFile BHO.doc                                Date: {Myfile} Initials: {BHO}
120101 My file name BHO.docx                  Date: {120101} Initials: {BHO}
Foo.bar                                       None    
WhyDidIUsePeriods.huh.doc                     None
120101 WhyDidIUsePeriods.huh.doc              Date: {WhyDidIUsePeriods, 120101}
WhyDidIUsePeriods BHO.huh.doc                 Date: {WhyDidIUsePeriods} Initials: {BHO}
120101 WhyDidIUsePeriods BHO.huh.doc          Date: {120101} Initials: {BHO}

So far, I have the following Regex:

@"^(?<Date>.+?(?= ))?.*?((?<Initials>(?<= )[^0-9]*?)|(?<Date>(?<= ).*?))?\..*?$"

This works for filenames of two word length, but not for anything larger (the trailing groups capture multiple words). The issue is the .*? after the first Date capture group. I need this to greedily capture all "interior" words without consuming the last word. I'm thinking negative lookahead, but I'm not sure how to structure it so the pattern both consumes all characters yet doesn't consume characters matching a certain negative lookahead pattern ( .*?\.).

(It's ok that the Date capture groups will capture non-dates, there is custom parsing logic for that later on)

Is what I want even possible with a negative lookahead? Is there a better strategy to meet these requirements?

EDIT:

I've illustrated what expected results will be next to each file example. I don't want any more specific Regex for the date because it could be in various non-numerical formats as well.

A Regex is unfortunately necessary, as in some cases, the problematic .*? will be replaced with more specific patterns (for example, say some files additionally need to contain the word "Foo", a Regex seems like the best tool).

Upvotes: 2

Views: 987

Answers (1)

Ro Yo Mi
Ro Yo Mi

Reputation: 15000

Description

This expression will:

  • assumes the only interesting data from the file name exists before the first dot
  • assumes the initials are three upper case, preceeded by a space, and will be followed by a dot
  • capture the non initials and non date portion of the file name
  • capture the entire file name upto but not including the first dot
  • capture the initials if they exist
  • capture the date if it exists
  • allow date, initials and file to appear in any order if they exist in the filename

For this I'm using the

^
(?=(?:[^.]*?(?<file>(?<=^)[a-zA-Z\s]*?(?=\s[A-Z]{3}\.|\s)|(?<=\s)[a-zA-Z\s]*?(?=\.|\s[A-Z]{3}\.)))?)   # get the file (aka not date and not initials
(?=(?:[^.]*?\s(?<Initials>[A-Z]{3})\.)?)      # get the initials
(?=(?:[^.]*?(?<Date>\d+))?)   # capture the date value if it exists.
(?=(?<FileName>.*?)\.)     # capture entire filename upto but not including the first dot
.*

enter image description here

Example

Live Demo

Sample Text

Myfile 120101.xls
120101 MyFile.xls
MyFile BHO.doc
120101 My file name BHO.docx
Foo.bar
WhyDidIUsePeriods.huh.doc
120101 WhyDidIUsePeriods.huh.doc
WhyDidIUsePeriods BHO.huh.doc
120101 WhyDidIUsePeriods BHO.huh.doc

Code

Regex re = new Regex(@"^(?=(?:[^.]*?(?<file>(?<=^)[a-zA-Z\s]*?(?=\s[A-Z]{3}\.|\s)|(?<=\s)[a-zA-Z\s]*?(?=\.|\s[A-Z]{3}\.)))?)(?=(?:[^.]*?\s(?<Initials>[A-Z]{3})\.)?)(?=(?:[^.]*?(?<Date>\d+))?)(?=(?<FileName>.*?)\.).*",RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline);
MatchCollection mc = re.Matches(sourcestring);

Matches

[0][0] = Myfile 120101.xls    
[0][file] = Myfile
[0][Initials] = 
[0][Date] = 120101
[0][FileName] = Myfile 120101

[1][0] = 120101 MyFile.xls    
[1][file] = MyFile
[1][Initials] = 
[1][Date] = 120101
[1][FileName] = 120101 MyFile

[2][0] = MyFile BHO.doc    
[2][file] = MyFile
[2][Initials] = BHO
[2][Date] = 
[2][FileName] = MyFile BHO

[3][0] = 120101 My file name BHO.docx
[3][file] = My file name
[3][Initials] = BHO
[3][Date] = 120101
[3][FileName] = 120101 My file name BHO

[4][0] = Foo.bar
[4][file] = Foo
[4][Initials] = 
[4][Date] = 
[4][FileName] = Foo

[5][0] = WhyDidIUsePeriods.huh.doc    
[5][file] = WhyDidIUsePeriods
[5][Initials] = 
[5][Date] = 
[5][FileName] = WhyDidIUsePeriods

[6][0] = 120101 WhyDidIUsePeriods.huh.doc    
[6][file] = WhyDidIUsePeriods
[6][Initials] = 
[6][Date] = 120101
[6][FileName] = 120101 WhyDidIUsePeriods

[7][0] = WhyDidIUsePeriods BHO.huh.doc    
[7][file] = WhyDidIUsePeriods
[7][Initials] = BHO
[7][Date] = 
[7][FileName] = WhyDidIUsePeriods BHO

[8][0] = 120101 WhyDidIUsePeriods BHO.huh.doc
[8][file] = WhyDidIUsePeriods
[8][Initials] = BHO
[8][Date] = 120101
[8][FileName] = 120101 WhyDidIUsePeriods BHO

Upvotes: 1

Related Questions