Regex required to match a file path where the file path is derived from the file name.

Question

I have a drive which contains over 8 million files and is the file storage area for a CRM system. The files are stored in a specific format and each one should have a matching record in the database. However due to some very poor security the world and is wife have also been creating files on the same drive. My task is to identify the invalid files which I am doing using Powershell and a regular expression. A typical valid file path would look something like this:

P:\PERSON\06\19\09\619090.5577930.DOC

All the files are on a drive called P: which contains four sub directories called: EVENT, OPPORTUN, ORGANISA or PERSON. Each of these contains a variable number of sub directories who's name ranges from 00 to 99 and the file name is two sets of digits separated by a period and followed by the extension.

The regex I am using to match this pattern is:

^P:$EVENT|OPPORTUN|ORGANISA|PERSON)\(\d{2}$+\d+\.\d+\.\w{3,4}$

The complication is that a valid file also has a relationship between the first set of digits in the files name and the sub directory path which is this:

Remove the last digit.

If the length of the digit is now an odd number add a leading zero.

Divide the resulting number into pairs and that should be the path.

So using the example above:

First set of digits are: 619090

Remove the last digit: 61909

The length is odd so add a leading zero: 061909

Divide into pairs: 06\19\09

My question is can this logic be incorporated into my regex, is there a way to do it using forward or back references?

Dave Sexton · Accepted Answer

Jerry's answer pointed me in the right direction and discovering that you can have capturing groups contained within non-capturing groups. Below is my regex together with a few tests:

$samples = @()
$samples += 'P:\PERSON\06\19\09\619090.5577930.DOC' #good
$samples += 'P:\PERSON\19\09\19090.5577930.DOC' #good
$samples += 'P:\PERSON\10\10\10\06\19\09\1010100619090.5577930.DOC' #good
$samples += 'P:\PERSON\06\19\09\619090a.5577930.DOC' #bad
$samples += 'P:\PERSON\06\19\09\61909090.5577930.DOC' #bad
$samples += 'P:\PERSON\06\19\09\6190905577930.DOC' #bad

$regex = '^P:$?:EVENT|OPPORTUN|ORGANISA|PERSON)\'
$regex += '(?:(\d)(\d)\|0(\d)$(?:(\d{2})\)?(?:(\d{2})\)?(?:(\d{2})\)?(?:(\d{2})\)?(?:(\d{2})\)?'
$regex += '(?:\1\2|\3)\4?\5?\6?\7?\8?\d?\.\d+\.\w{3,4}$'

$samples | % {
    $_ -match $regex

}

Regex required to match a file path where the file path is derived from the file name.

Answers (2)

Related Questions