fresh learner
fresh learner

Reputation: 467

Regex to parse file name in Java

I am trying to parse a file name according to a given pattern but not able to perfect the match. Here is a sample file name:

CRS-ISAU-RPV#3430_Dedalus_Conc.ok.erto_AOTreviglio.doc

And here are my requirements:

til the character # the file name can contain anything, after #, i have to find character _ or the character - to separate a string. The string in between the character(optionally _ or - - but not both) can contain any other character. So eventually after the character # i must have exactly three (3) _ or - characters combined. The string should end with .doc or .docx or .odt but NOT .ok.doc or .ok.docx or .ok.odt.

Here is what i tried:

(.*)#([^_-]+)[_-]([^_-]+)[_-]([^_-]+)[_-]([^_-]+)\.[doc|odt|docx].*(?<!\.ok)$

But this forces me to end the string with .doc.ok or .docs.ok or .docx.ok and actually i want to retain the file extension at the end.

If i try this:

(.*)#([^_-]+)[_-]([^_-]+)[_-]([^_-]+)[_-]([^_-]+)\..*(?<!ok\.[doc|odt|docx])$

it wont work.

Any help would be appreciated. Thank you :)

Upvotes: 1

Views: 1366

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626689

It seems you can use

^([^#]*#[^-_]*)[-_](.*)$(?<=(?<!\.ok)\.(?:docx?|odt)$)

Explanation:

  • ^ - start of string (not necessary when used with .matches(), but not harmful)
  • ([^#]*#[^-_]*) - Group 1: any 0+ characters other than # ([^#]*) followed with # and then any 0+ characters iother than - and _ (with [-_])
  • (.*)$ - match 0+ characters other than a newline (since DOTALL mode is not specified) up to the end of string BUT...
  • (?<=(?<!\.ok)\.(?:docx?|odt)$) - after reaching the end, check if there is .doc or .docx or .odt at the end (see (?<=\.(?:docx?|odt)$)) that are not preceded with .ok (see (?<!\.ok)). In PCRE, these conditions should be split, Java regex seems to cope with alternations inside the lookbehind.

A lookahead-based alternative:

^([^#]*#[^-_]*)[-_](?=.*(?<!\.ok)\.(?:docx?|odt)$)(.*)$

See the regex101 demo. It is the same, but all the end-of-string checks are done after matching the - or _.

See the Java demo:

List<String> strs = Arrays.asList("CRS-ISAU-RPV#3430_Dedalus_Conc.ok.erto_AOTreviglio.doc",
            "CRS-ISAU-RPV#3430_Dedalus_Conc.ok.erto_AOTreviglio.docx",
            "CRS-ISAU-RPV#3430_Dedalus_Conc.ok.erto_AOTreviglio.odt",
            "CRS-ISAU-RPV#3430_Dedalus_Conc.ok.erto_AOTreviglio.ok.docx",
            "CRS-ISAU-RPV#3430_Dedalus_Conc.ok.erto_AOTreviglio.ok.odt"
        );
for (String str : strs) {
    System.out.println("----------\nMatching: " + str);
    Matcher m = Pattern.compile("^([^#]*#[^-_]*)([-_])(.*)$(?<=(?<![.]ok)[.](?:docx?|odt)$)").matcher(str);
    if (m.matches()) {
        System.out.println(m.group(1));
        System.out.println(m.group(2));
        System.out.println(m.group(3));
    } else { System.out.println("No match"); }
}

Upvotes: 2

Related Questions