Steve B
Steve B

Reputation: 37710

Extract filename parts to solve file name conflict

In a c# program, I want to write file, in a folder where other file may exists. If so, a suffix may be added to the file myfile.docx, myfile (1).docx, myfile (2).docx and so on.

I'm struggling at analysing existing file name to extract existing files' name parts.

Especially, I use this regex: (?<base>.+?)(\((?<idx>\d+)\)?)?(?<ext>(\.[\w\.]+)).

This regex outputs:

╔═══════════════════════╦══════════════╦═════╦═══════════╦═══════════════════════════════════╗
║    Source Filename    ║     base     ║ idx ║ extension ║              Success              ║
╠═══════════════════════╬══════════════╬═════╬═══════════╬═══════════════════════════════════╣
║ somefile.docx         ║ somefile     ║     ║ .docx     ║ Yes                               ║
║ somefile              ║              ║     ║           ║ No, base should be "somefile"     ║
║ somefile (6)          ║              ║     ║           ║ No, base should be "somefile (6)" ║
║ somefile (1).docx     ║ somefile     ║   1 ║ .docx     ║ Yes                               ║
║ somefile (2)(1).docx  ║ somefile (2) ║   1 ║ .docx     ║ Yes                               ║
║ somefile (4).htm.tmpl ║ somefile     ║   4 ║ .htm.tmpl ║ Yes                               ║
╚═══════════════════════╩══════════════╩═════╩═══════════╩═══════════════════════════════════╝

As you can see, all cases are working excepted when a file name has no extension.

How to fix my regex to solve the failling cases ?

Reproduction : https://regex101.com/r/q9uQii/1

If it matterns, here the relevant C# code :

private static readonly Regex g_fileNameAnalyser = new Regex(
    @"(?<base>.+?)(\((?<idx>\d+)\)?)?(?<ext>(\.[\w\.]+))", 
    RegexOptions.Compiled | RegexOptions.ExplicitCapture
    );

...

var candidateMatch = g_fileNameAnalyser.Match(somefilename);
var candidateInfo = new
{
    baseName = candidateMatch.Groups["base"].Value.Trim(),
    idx = candidateMatch.Groups["idx"].Success ? int.Parse(candidateMatch.Groups["idx"].Value) : 0,
    ext = candidateMatch.Groups["ext"].Value
};

Upvotes: 3

Views: 329

Answers (2)

The fourth bird
The fourth bird

Reputation: 163477

What you might do is repeat the () part that contains digits asserting there is a next pair. Then capture that next part with the digits as the idx group.

Make the idx group and the ext group optional using a question mark.

^(?<base>[^\r\n.()]+(?:(?:\(\d+\))*(?=\(\d+\)))?)(?:\((?<idx>\d+)\))?(?<ext>(?:\.[\w\.]+))?$
  • ^ Start of string
  • (?<base> Start base group
    • [^\r\n.()]+ Match 1+ times any char except the listed
    • (?: Non capturing group
      • (?:\(\d+\))*(?=\(\d+\)) Repeat matching (digits) until there is 1 (digits) part left at the right
    • )? Close group and make it optional
  • ) End base group
  • (?:\((?<idx>\d+)\))? Optional part to match idx group between ( and )
  • (?<ext>(?:\.[\w\.]+))? Optional ext group
  • $ End of string

Regex demo

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627220

You may use

^(?<base>.+?)\s*(?:\((?<idx>\d+)\))?(?<ext>\.[\w.]+)?$

See the regex demo, results:

enter image description here

Pattern details

  • ^ - start of string
  • (?<base>.+?) - Group "base": any 1 or more chars other than newline, as fewa s possible
  • \s* - 0+ whitespaces
  • (?:\((?<idx>\d+)\))? - an optional sequence of:
    • \( - a ( char
    • (?<idx>\d+) - Group "idx": 1+ digits
    • \) - a ) char
  • (?<ext>\.[\w.]+)? - - an optional Group "ext":
    • \. - a . char
    • [\w.]+ - 1+ letters, digits, _ or . chars
  • $ - end of string.

Upvotes: 1

Related Questions