Reputation: 11188
I have a drive which contains over 8 million files and is the file storage area for a CRM system. The files are stored in a specific format and each one should have a matching record in the database. However due to some very poor security the world and is wife have also been creating files on the same drive. My task is to identify the invalid files which I am doing using Powershell and a regular expression. A typical valid file path would look something like this:
P:\PERSON\06\19\09\619090.5577930.DOC
All the files are on a drive called P: which contains four sub directories called: EVENT, OPPORTUN, ORGANISA or PERSON. Each of these contains a variable number of sub directories who's name ranges from 00 to 99 and the file name is two sets of digits separated by a period and followed by the extension.
The regex I am using to match this pattern is:
^P:\\(EVENT|OPPORTUN|ORGANISA|PERSON)\\(\d{2}\\)+\d+\.\d+\.\w{3,4}$
The complication is that a valid file also has a relationship between the first set of digits in the files name and the sub directory path which is this:
Remove the last digit.
If the length of the digit is now an odd number add a leading zero.
Divide the resulting number into pairs and that should be the path.
So using the example above:
First set of digits are: 619090
Remove the last digit: 61909
The length is odd so add a leading zero: 061909
Divide into pairs: 06\19\09
My question is can this logic be incorporated into my regex, is there a way to do it using forward or back references?
Upvotes: 0
Views: 3573
Reputation: 11188
Jerry's answer pointed me in the right direction and discovering that you can have capturing groups contained within non-capturing groups. Below is my regex together with a few tests:
$samples = @()
$samples += 'P:\PERSON\06\19\09\619090.5577930.DOC' #good
$samples += 'P:\PERSON\19\09\19090.5577930.DOC' #good
$samples += 'P:\PERSON\10\10\10\06\19\09\1010100619090.5577930.DOC' #good
$samples += 'P:\PERSON\06\19\09\619090a.5577930.DOC' #bad
$samples += 'P:\PERSON\06\19\09\61909090.5577930.DOC' #bad
$samples += 'P:\PERSON\06\19\09\6190905577930.DOC' #bad
$regex = '^P:\\(?:EVENT|OPPORTUN|ORGANISA|PERSON)\\'
$regex += '(?:(\d)(\d)\\|0(\d)\\)(?:(\d{2})\\)?(?:(\d{2})\\)?(?:(\d{2})\\)?(?:(\d{2})\\)?(?:(\d{2})\\)?'
$regex += '(?:\1\2|\3)\4?\5?\6?\7?\8?\d?\.\d+\.\w{3,4}$'
$samples | % {
$_ -match $regex
}
Upvotes: 0
Reputation: 71568
I tried to come up with something, and if powershell supports backreferences, you could try this:
^P:\\(?:EVENT|OPPORTUN|ORGANISA|PERSON)\\(?:0(\d)|(\d{2}))\\(\d{2})\\(?P<t>\d{2})\\(?:(?:\1|\2)\3\4)0?\.\d+\.\w{3,4}$
The \1
to \4
refer to the different capture groups earlier found.
I tested some strings on regex101.
The only thing is that it will accept P:\OPPORTUN\61\90\90\619090.5577930.DOC
as well. I'm not too sure how to go around this with only one regex... or making the regex even longer than it already is (a bit more than twice for this maybe).
It's about twice as long if you want to really make sure:
^P:\\(?:EVENT|OPPORTUN|ORGANISA|PERSON)\\0(\d)\\(\d{2})\\(\d{2})\\(?:\1\2\3)0\.\d+\.\w{3,4}|P:\\(?:EVENT|OPPORTUN|ORGANISA|PERSON)\\(\d{2})\\(\d{2})\\(\d{2})\\(?:\4\5\6)\.\d+\.\w{3,4}$
EDIT: Up to 7 pair of digits:
^P:\\(?:EVENT|OPPORTUN|ORGANISA|PERSON)\\(?:0(\d)|(\d\d))\\(?:(\d\d)\\)?(?:(\d\d)\\)?(?:(\d\d)\\)?(?:(\d\d)\\)?(?:(\d\d)\\)?(?:(?:\1|\2)\3?\4?\5?\6?\7?)0?\.\d+\.\w{3,4}
Upvotes: 2