I'm trying to match several different filename syntaxs with one regex. In other words I am trying to match filename string with the same characters in different orders. The problem is I don't know how to string together "OR" | cases when it comes to groups. Group syntax: Product names constitute any letter any number with optional "-", "_" or spaces in between characters. "-", "_" or spaces are never at the beginning or end of Product name. PAF or PA always have a leading "-" followed by a trailing "-" and then a number. Revision codes constitute "FG", "RD", "X", "A", or "\d+" all except last followed directly but a number. Sheet number is lower or upper cases (hence the re.IGNORECASE) preceded by a "-" a space or nothing, then the word "sheet" followed by "-" space or nothing and then a number. The filenames follow these patterns: (Product Name)-(PAF/PA-#) (Sheet #)-(Revision) (\w(?:\w*(?:-|\s|_)?\w+)*)(-PA(?:F|)-\d+)(?:(?:\s|-)sheet(?:\s|-)\d+)(-(?:FG|RD|X|A|)\d+) (Product Name)-(PAF/PA-#)-(Revision) (Sheet #) (\w(?:\w*(?:-|\s|_)?\w+)*)(-PA(?:F|)-\d+)(?:(?:\s|-)sheet(?:\s|-)\d+)(-(?:FG|RD|X|A|)\d+) (Product Name)-(PAF/PA-#)-(Revision) (\w(?:\w*(?:-|\s|_)?\w+)*)(-PA(?:F|)-\d+)(-(?:FG|RD|X|A|)\d+) (Product Name)-(Revision) (Sheet #) (\w(?:\w*(?:-|\s|_)?\w+)*)(-(?:FG|RD|X|A|)\d+)(?:(?:\s|-)sheet(?:\s|-)\d+) (Product Name)-(Revision) (\w(?:\w*(?:-|\s|_)?\w+)*)(-(?:FG|RD|X|A|)\d+) PAF PA are product type denotations, Sheet # is useless info, and FG#, RD#, X#, A#, or # are all product revisions. I need product name, denotation, and revision all to be in their own group. ^(\w(?:\w*(?:-|\s|_)?\w+)*) (?: (-(?:FG|RD|X|A|)\d+)| (-PA(?:F|)-\d+)(-(?:FG|RD|X|A|)\d+)| (-PA(?:F|)-\d+)(?:(?:\s|-)sheet(?:\s|-)\d+)| (-PA(?:F|)-\d+)(?:(?:\s|-)sheet(?:\s|-)\d+)(-(?:FG|RD|X|A|)\d+) ) (?:.*)?$ I've tried the above regex but it doesn't work correctly. First it returns too many groups, I would only like 3. pattern = re.compile(r'''^(\w(?:\w*(?:-|\s|_)?\w+)*) # match any alphanumeric and dashes without leading or trailing dashes (-PA(?:F|)-\d+) # match '-PAF-<number>' or '-PA-<number>' (?:(?:\s|-|)?sheet(?:\s|-|)?\d+)? # match '?sheet?<number>' where ? can be <space> or '-' (-(?:FG|RD|X|A|)\d+)? # match '-FG<number>', '-RD<number>', '-X<number>', '-A<number>' or <number> (?:.*)?$''', flags=re.IGNORECASE|re.VERBOSE) The aforementioned strings should match with the regex.

Reputation: 119

Multiple regex or conditions using groups

I'm trying to match several different filename syntaxs with one regex. In other words I am trying to match filename string with the same characters in different orders. The problem is I don't know how to string together "OR" | cases when it comes to groups.

Group syntax:

Product names constitute any letter any number with optional "-", "_" or spaces in between characters. "-", "_" or spaces are never at the beginning or end of Product name.
PAF or PA always have a leading "-" followed by a trailing "-" and then a number.
Revision codes constitute "FG", "RD", "X", "A", or "\d+" all except last followed directly but a number.
Sheet number is lower or upper cases (hence the re.IGNORECASE) preceded by a "-" a space or nothing, then the word "sheet" followed by "-" space or nothing and then a number.

The filenames follow these patterns:

(Product Name)-(PAF/PA-#) (Sheet #)-(Revision)
(\w(?:\w*(?:-|\s|_)?\w+)*)(-PA(?:F|)-\d+)(?:(?:\s|-)sheet(?:\s|-)\d+)(-(?:FG|RD|X|A|)\d+)
(Product Name)-(PAF/PA-#)-(Revision) (Sheet #)
(\w(?:\w*(?:-|\s|_)?\w+)*)(-PA(?:F|)-\d+)(?:(?:\s|-)sheet(?:\s|-)\d+)(-(?:FG|RD|X|A|)\d+)
(Product Name)-(PAF/PA-#)-(Revision)
(\w(?:\w*(?:-|\s|_)?\w+)*)(-PA(?:F|)-\d+)(-(?:FG|RD|X|A|)\d+)
(Product Name)-(Revision) (Sheet #)
(\w(?:\w*(?:-|\s|_)?\w+)*)(-(?:FG|RD|X|A|)\d+)(?:(?:\s|-)sheet(?:\s|-)\d+)
(Product Name)-(Revision)
(\w(?:\w*(?:-|\s|_)?\w+)*)(-(?:FG|RD|X|A|)\d+)

PAF PA are product type denotations, Sheet # is useless info, and FG#, RD#, X#, A#, or # are all product revisions. I need product name, denotation, and revision all to be in their own group.

^(\w(?:\w*(?:-|\s|_)?\w+)*)
(?:
(-(?:FG|RD|X|A|)\d+)|
(-PA(?:F|)-\d+)(-(?:FG|RD|X|A|)\d+)|
(-PA(?:F|)-\d+)(?:(?:\s|-)sheet(?:\s|-)\d+)|
(-PA(?:F|)-\d+)(?:(?:\s|-)sheet(?:\s|-)\d+)(-(?:FG|RD|X|A|)\d+)
)
(?:.*)?$

I've tried the above regex but it doesn't work correctly. First it returns too many groups, I would only like 3.

pattern = re.compile(r'''^(\w(?:\w*(?:-|\s|_)?\w+)*)        # match any alphanumeric and dashes without leading or trailing dashes
                         (-PA(?:F|)-\d+)                    # match '-PAF-<number>' or '-PA-<number>'
                         (?:(?:\s|-|)?sheet(?:\s|-|)?\d+)?  # match '?sheet?<number>' where ? can be <space> or '-'
                         (-(?:FG|RD|X|A|)\d+)?              # match '-FG<number>', '-RD<number>', '-X<number>', '-A<number>' or <number>
                         (?:.*)?$''', flags=re.IGNORECASE|re.VERBOSE)

The aforementioned strings should match with the regex.

Upvotes: 0

Answers (2)

DBFS

Reputation: 119

After practicing a lot with regex, and with the help from GDN i found the solution:

(.*?)(?:(?:\s|_|-|\s?-\s|\s-\s?)(?=(?:PAF-|PA-|FG|RD|X|A)\d+))((?:PAF|PA)-\d+)?(?:\s|_|-|\s?-\s|\s-\s?)?(?:.*?)?((?:FG|RD|X|A)\d+)

import re


def input_loop(pattern, doc_type):
    while True:
        filename = raw_input('Enter {}, Enter "x" to Close: '.format(doc_type))
        if filename == 'x':
            break
        matches = pattern.match(filename)
        if matches:
            groups = matches.groups()
            print groups
        else:
            print '''Couldn't match string: "{}"'''.format(filename)

pattern = re.compile(r'''
  (.*?)(?# Match product name)
  (?:(?:\s|_|-|\s?-\s|\s-\s?)(?=(?:PAF-|PA-|FG|RD|X|A)\d+))(?# Match spacer after product name)
  ((?:PAF|PA)-\d+)?(?# Match optional PAF-# or PA-#)
  (?:\s|_|-|\s?-\s|\s-\s?)?(?# Match spacer after product type name)
  (?:.*?)?(?# Match useless data)
  ((?:FG|RD|X|A)\d+)''', flags=re.IGNORECASE|re.VERBOSE)
input_loop(pattern, 'Assembly Drawings')

Upvotes: 0

GDN

Reputation: 312

EDIT after new examples

To avoid over complicating the regex pattern (which is quite enough already), I would get rid of the "sheet" part first.

So first, remove the "sheet #" pattern from filename before applying the match instruction.

This will reduce your cases to these patterns only:

(Product Name)-(PAF/PA-#)-(Revision)
(Product Name)-(Revision)

Then apply the regex to split in three groups. For the revision group I used (?!...) a negative lookahead assertion to handle cases like this: "2400PSUA-8-PA-1-X0"

Here's the revised code:

import re

string = """10G-HUB-PAF-1 Sheet 1-FG0
HUB-DISP-SPCR-RD0
HUB-MAIN-PA-1-FG0
2400ODU-PA-1-X0 Sheet 1
2400PSUA-8-PA-1-Sheet1-X0
2405OE-PAF-1-FG0
2400PSUA-8-PA-1-Sheet1-X0
XXXX-XXX-123-PAF-1-FG0 Sheet 1
"""

regex = r'(?# product name )(.*?)' + \
        r'(?# PA|PAF       )(?:(?:-)(?:(PAF-\d|PA-\d).*))?' + \
        r'(?# Revision     )(?:-)((?:\d)(?!.*(?:FG|RD|X|A\d))|(?:(?:FG|RD|X|A)\d))'

pattern = re.compile(regex, flags=re.IGNORECASE|re.VERBOSE)

for s in string.splitlines():
    print('String %s' % s)
    # Remove 'Sheet#' or 'Sheet #' or '-Sheet #' or '-Sheet#'
    s=re.sub(r'-?sheet\s?\d','',s, flags=re.IGNORECASE)
    print('Purged string: %s' % s)
    f=pattern.match(s)
    print('group1: %s' % f.group(1))
    print('group2: %s' % f.group(2))
    print('group3: %s\n' % f.group(3))

With some output:

String 10G-HUB-PAF-1 Sheet 1-FG0
Purged string: 10G-HUB-PAF-1 -FG0
group1: 10G-HUB
group2: PAF-1
group3: FG0

String HUB-DISP-SPCR-RD0
Purged string: HUB-DISP-SPCR-RD0
group1: HUB-DISP-SPCR
group2: None
group3: RD0

...omitted output ...

String 2400PSUA-8-PA-1-Sheet1-X0
Purged string: 2400PSUA-8-PA-1-X0
group1: 2400PSUA-8
group2: PA-1
group3: X0

String XXXX-XXX-123-PAF-1-FG0 Sheet 1
Purged string: XXXX-XXX-123-PAF-1-FG0 
group1: XXXX-XXX-123
group2: PAF-1
group3: FG0

Upvotes: 1

Multiple regex or conditions using groups

Answers (2)

Related Questions