Dhiwakar Ravikumar
Dhiwakar Ravikumar

Reputation: 2207

Regex With Lookahead For Fixed Length String

strings = [
    r"C:\Photos\Selfies\1|",
    r"C:\HDPhotos\Landscapes\2|",
    r"C:\Filters\Pics\12345678|",
    r"C:\Filters\Pics2\00000000|",
    r"C:\Filters\Pics2\00000000|XAV7"
    ]
    
for string in strings:
    matchptrn = re.match(r"(?P<file_path>.*)(?!\d{8})", string)
    if matchptrn:
        print("FILE PATH = "+matchptrn.group('file_path'))

I am trying to get this regular expression with a lookahead to work the way I though it would. Examples of Look Aheads on most websites seem to be pretty basic string matches i.e. not matching 'bar' if it is preceded by a 'foo' as an example of a negative look behind.

My goal is to capture in the group file_path the actual file path only if the string does NOT have an 8 character length number in it just before the pipe symbol | and match anything after the pipe symbol in another group (something I haven't implemented here).

So in the above example it should match only the first two strings

C:\Photos\Selfies\1
C:\HDPhotos\Landscapes\2

In case of the last string

C:\Filters\Pics2\00000000|XAV7

I'd like to match C:\Filters\Pics2\00000000 in <file_path> and match XAV7in another group named .
(This is something I can figure out on my own if I get some help with the negative look ahead)

Currently <file_path> matches everything, which makes sense since it is non-greedy (.*) I want it to only capture if the last part of the string before the pipe symbol is NOT an 8 length character.

OUTPUT OF CODE SNIPPET PASTED BELOW

FILE PATH = C:\Photos\Selfies\1|
FILE PATH = C:\HDPhotos\Landscapes\2|
FILE PATH = C:\Filters\Pics\12345678|
FILE PATH = C:\Filters\Pics2\00000000|
FILE PATH = C:\Filters\Pics2\00000000|XAV7

Making this modification of \\

matchptrn = re.match(r"(?P<file_path>.*)\\(?!\d{8})", string)
if matchptrn:
    print("FILE PATH = "+matchptrn.group('file_path'))

makes things worse as the output is

FILE PATH = C:\Photos\Selfies
FILE PATH = C:\HDPhotos\Landscapes
FILE PATH = C:\Filters
FILE PATH = C:\Filters
FILE PATH = C:\Filters

Can someone please explain this as well ?

Upvotes: 2

Views: 162

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626845

You can use

^(?!.*\\\d{8}\|$)(?P<file_path>.*)\|(?P<suffix>.*)

See the regex demo.

Details

  • ^ - start of a string
  • (?!.*\\\d{8}\|$) - fail the match if the string contains \ followed with eight digits and then | at the end of string
  • (?P<file_path>.*) - Group "file_path": any zero or more chars other than line break chars as many as possible
  • \| - a pipe
  • (?P<suffix>.*) - Group "sfuffix": the rest of the string, any zero or more chars other than line break chars, as many as possible.

See the Python demo:

import re
strings = [
    r"C:\Photos\Selfies\1|",
    r"C:\HDPhotos\Landscapes\2|",
    r"C:\Filters\Pics\12345678|",
    r"C:\Filters\Pics2\00000000|",
    r"C:\Filters\Pics2\00000000|XAV7"
    ]
    
for string in strings:
    matchptrn = re.match(r"(?!.*\\\d{8}\|$)(?P<file_path>.*)\|(?P<suffix>.*)", string)
    if matchptrn:
        print("FILE PATH = {}, SUFFIX = {}".format(*matchptrn.groups()))

Output:

FILE PATH = C:\Photos\Selfies\1, SUFFIX = 
FILE PATH = C:\HDPhotos\Landscapes\2, SUFFIX = 
FILE PATH = C:\Filters\Pics2\00000000, SUFFIX = XAV7

Upvotes: 1

Related Questions