Dave Sexton
Dave Sexton

Reputation: 11188

Regex required to match a file path where the file path is derived from the file name.

I have a drive which contains over 8 million files and is the file storage area for a CRM system. The files are stored in a specific format and each one should have a matching record in the database. However due to some very poor security the world and is wife have also been creating files on the same drive. My task is to identify the invalid files which I am doing using Powershell and a regular expression. A typical valid file path would look something like this:

P:\PERSON\06\19\09\619090.5577930.DOC

All the files are on a drive called P: which contains four sub directories called: EVENT, OPPORTUN, ORGANISA or PERSON. Each of these contains a variable number of sub directories who's name ranges from 00 to 99 and the file name is two sets of digits separated by a period and followed by the extension.

The regex I am using to match this pattern is:

^P:\\(EVENT|OPPORTUN|ORGANISA|PERSON)\\(\d{2}\\)+\d+\.\d+\.\w{3,4}$

The complication is that a valid file also has a relationship between the first set of digits in the files name and the sub directory path which is this:

Remove the last digit.

If the length of the digit is now an odd number add a leading zero.

Divide the resulting number into pairs and that should be the path.

So using the example above:

First set of digits are: 619090

Remove the last digit: 61909

The length is odd so add a leading zero: 061909

Divide into pairs: 06\19\09

My question is can this logic be incorporated into my regex, is there a way to do it using forward or back references?

Upvotes: 0

Views: 3573

Answers (2)

Dave Sexton
Dave Sexton

Reputation: 11188

Jerry's answer pointed me in the right direction and discovering that you can have capturing groups contained within non-capturing groups. Below is my regex together with a few tests:

$samples = @()
$samples += 'P:\PERSON\06\19\09\619090.5577930.DOC' #good
$samples += 'P:\PERSON\19\09\19090.5577930.DOC' #good
$samples += 'P:\PERSON\10\10\10\06\19\09\1010100619090.5577930.DOC' #good
$samples += 'P:\PERSON\06\19\09\619090a.5577930.DOC' #bad
$samples += 'P:\PERSON\06\19\09\61909090.5577930.DOC' #bad
$samples += 'P:\PERSON\06\19\09\6190905577930.DOC' #bad

$regex = '^P:\\(?:EVENT|OPPORTUN|ORGANISA|PERSON)\\'
$regex += '(?:(\d)(\d)\\|0(\d)\\)(?:(\d{2})\\)?(?:(\d{2})\\)?(?:(\d{2})\\)?(?:(\d{2})\\)?(?:(\d{2})\\)?'
$regex += '(?:\1\2|\3)\4?\5?\6?\7?\8?\d?\.\d+\.\w{3,4}$'

$samples | % {
    $_ -match $regex

}

Upvotes: 0

Jerry
Jerry

Reputation: 71568

I tried to come up with something, and if powershell supports backreferences, you could try this:

^P:\\(?:EVENT|OPPORTUN|ORGANISA|PERSON)\\(?:0(\d)|(\d{2}))\\(\d{2})\\(?P<t>\d{2})\\(?:(?:\1|\2)\3\4)0?\.\d+\.\w{3,4}$

The \1 to \4 refer to the different capture groups earlier found.

I tested some strings on regex101.

The only thing is that it will accept P:\OPPORTUN\61\90\90\619090.5577930.DOC as well. I'm not too sure how to go around this with only one regex... or making the regex even longer than it already is (a bit more than twice for this maybe).

It's about twice as long if you want to really make sure:

^P:\\(?:EVENT|OPPORTUN|ORGANISA|PERSON)\\0(\d)\\(\d{2})\\(\d{2})\\(?:\1\2\3)0\.\d+\.\w{3,4}|P:\\(?:EVENT|OPPORTUN|ORGANISA|PERSON)\\(\d{2})\\(\d{2})\\(\d{2})\\(?:\4\5\6)\.\d+\.\w{3,4}$

EDIT: Up to 7 pair of digits:

^P:\\(?:EVENT|OPPORTUN|ORGANISA|PERSON)\\(?:0(\d)|(\d\d))\\(?:(\d\d)\\)?(?:(\d\d)\\)?(?:(\d\d)\\)?(?:(\d\d)\\)?(?:(\d\d)\\)?(?:(?:\1|\2)\3?\4?\5?\6?\7?)0?\.\d+\.\w{3,4}

Upvotes: 2

Related Questions