Reputation: 17528
How can I fix this RegEx to optionally capture a file extension?
I am trying to match a string with an optional component, but something appears to be wrong. (The strings being matched are from a printer log.)
My RegEx (.NET Flavor) is as follows:
.*(header_\d{10,11}_).*(_.*_\d{8}).*(\.\w{3,4}).*
-------------------------------------------
.* # Ignore some garbage in the front
(header_ # Match the start of the file name,
\d{10,11}_) # including the ID (10 - 11 digits)
.* # Ignore the type code in the middle
(_.*_\d{8}) # Match some random characters, then an 8-digit date
.* # Ignore anything between this and the file extension
(\.\w{3,4}) # Match the file extension, 3 or 4 characters long
.* # Ignore the rest of the string
I expect this to match strings like:
str1 = "header_0000000602_t_mc2e1nrobr1a3s55niyrrqvy_20081212[1].doc [Compatibility Mode]"
str2 = "Microsoft PowerPoint - header_00000000076_d_al41zguyvgqfj2454jki5l55_20071203[1].txt"
str3 = "header_00000000076_d_al41zguyvgqfj2454jki5l55_20071203[1]"
Where the capture groups return something like:
$1 = header_0000000602_
$2 = _mc2e1nrobr1a3s55niyrrqvy_20081212
$3 = .doc
Where $3 can be empty if no file extension is found. $3 is the optional part, as you can see in str3 above.
If I add "?" to the end of the third capture group "(.\w{3,4})?", the RegEx no longer captures $3 for any string. If I add "+" instead "(.\w{3,4})+", the RegEx no longer captures str3 at all, which is to be expected.
I feel that using "?" at the end of the third capture group is the appropriate thing to do, but it doesn't work as I expect. I am probably being too naive with the ".*" sections that I use to ignore parts of the string.
Doesn't Work As Expected:
.*(header_\d*_).*(_.*_.{8}).*(\.\w{3,4})?.*
Upvotes: 6
Views: 10273
Reputation: 43815
This is your correct result
.*?(header_\d*_).*?(_.*_.{8})[^.]*(\.\w{3,4})?.*
-------------------------------------------
.*? # Prevent a greedy match
(header_ #
\d{10,11}_) #
.*? # Prevent a greedy match
(_.*_\d{8}) #
[^.]* # Take everything that is NOT a period
(\.\w{3,4}) # Match the extension
.* #
The implicit assumption is that the period will be the beginning of a file extension after the digits match. The following wouldn't meet this requirement:
string unmatched = "header_00000000076_d_al41zguyvgqfj2454jki5l55_20071203[1].foobar.txt"
Also, when taking out your groups in .NET make sure your code looks like this:
regex.Match(string_to_match).Groups[1].Value
regex.Match(string_to_match).Groups[2].Value
regex.Match(string_to_match).Groups[3].Value
and not this:
// 0 index == string_to_match
regex.Match(string_to_match).Groups[0].Value
regex.Match(string_to_match).Groups[1].Value
regex.Match(string_to_match).Groups[2].Value
This is something that tripped me up at first.
Upvotes: 2
Reputation: 9335
Here is one that works for what you're posting:
^.*(?<header>header_\d{10,11})_.*(?<date>_[a-z0-9]+_\d{8})(\[\d+\])(?<ext>(\.[a-zA-Z0-9]{3,4})?).*
The replacement is:
Header: $1
Date: $2
Extension: $4
I didn't use the named groups in the replacement because I couldn't figure out how to get TextMate to do it, but the named groups were helpful to force the capture.
Upvotes: 1
Reputation: 57832
This works for the examples you've posted:
^.*?(?<header>\d+)_.*?_(?<date>\d{8}).*?(?:\.(?<ext>\w{3,4}))?[\w\s\[\]]*$
I'm assuming that the text "header" and the random characters between that and the date aren't important, so those aren't captured by this regex. I also used the .NET named capture feature for clarity, but be aware that it isn't supported in other flavors of RegEx.
If the text after the file name contains any non-alphanumeric characters other than [ and ], the pattern will need to be revised.
Upvotes: 1
Reputation: 16505
Specify in your second match that you only want to match all characters that do not have the period in them then do your match for your extension.
".*(header_\d{10,11}_).*(_.*_\d{8})[^.]*(\.\w{3,4})?"
Upvotes: 2
Reputation: 120644
One possibility is that the second to last .*
is being greedy. You might try changing it to:
.*(header_\d*_).*(_.*_.{8}).*?(\.\w{3,4})?.*
^ Added that
That wasn't correct, this one will match the input you supplied, but it assumes that the first .
it encounters is the start of a file extension:
.*(header_\d*_).*(_.*_.{8})[^\.]*(\.\w{3,4})?.*
Edit: Remove the escaping I had in the second regex.
Upvotes: 6
Reputation: 26770
Well, .*
is probably the wrong way to start the regex- it will match 0 or more (*
) single characters of anything (.) ...which means your entire file name will be matched by that alone. If you leave that off the regex will start matching when it reaches header
which is what you want. You could also replace it with \w
, which matches word breaks. I also suggest using a tool such as The Regex Coach so you can step through it and see exactly what's wrong and what your capture groups will be.
Upvotes: 2
Reputation: 54421
I believe the problem is in your 3rd .*
, which you annotated above with "Ignore anything between this and the file extension". It's greedy, so it will match ANYTHING. When you make the extension pattern optional, the 3rd .*
matches up to the end of the string, which is allowed. Assuming that there will NEVER be a '.
' character in that extraneous bit, you can replace .*
with [^.]*
and the rest will hopefully work after you restore the ?
that you had to remove.
Upvotes: 3