xph
xph

Reputation: 997

Python regex format

I'm trying to match some strings using Pythons re-module, but cant get it done correctly. The strings i've to deal with look like this (example):

XY_efgh_1234_0040_rev_2_1_NC_asdf
XY_abcd_1122Ae_1150_rev2_1_NC
XY_efgh_0124e_50_NC
asdf_1980_2234a_2
XY_abcd_5098_2270_2_1_NC
PC_bos_7659Ae_1450sp_rev_2_1_NC_GRAPH

The pattern there is not constant, it could vary to some point. This is important to me:

So, according to the previous example, this is what i need to work with:

('1234',   '0040_rev_2_1')
('1122Ae', '1150_rev2_1')
('0124e',  '50')
('1980',   '2234a_2')
('5098',   '2270_2_1')
('7659Ae', '1450sp_rev_2_1')

...I've never done such if-and-ifnot things in RegEx, it's driving me crazy. Here is what I've got so far, but it's not exactly what I need:

pattern     = re.compile(
              r"""
              ([0-9]{4}
              [A-Z]{0,3})
              [_-]{1,3}
              ([0-9]{2,4}
              [0-9A-Z_-]{0,16})
              """,
              re.IGNORECASE | 
              re.VERBOSE
              )

if re.search(pattern, string):
    print re.findall(pattern, string)

When I use this on the last mentioned Example, this is what I get:

[(u'7659Ae', u'1450sp_rev_2_1_NC_GR')]

...almost what I need - but I don't know how to exclude this _NC_GR at the end, and this simple method of limiting the characters by count is just not good.

Does anyone have a nice and working solution to this case?

Upvotes: 4

Views: 1283

Answers (2)

eyquem
eyquem

Reputation: 27575

For me, the solution of Martijn doesn't work. So I give my solution.

Take attention to the fact that I don't use re.IGNORECASE
Hence, my regex is able to catch the end of
PC_bos_7659Ae_1450sp_rev_2_1_nc_woof
I don't know if it is really what you want in this case

inputtext = """XY_efgh_1234_0040_rev_2_1_NC_asdf
XY_abcd_1122Ae_1150_rev2_1_NC
XY_efgh_0124e_50_NC
asdf_1980_2234a_2
XY_abcd_5098_2270_2_1_NC
PC_bos_7659Ae_1450sp_rev_2_1_NC_GRAPH
PC_bos_7659Ae_1450sp_rev_2_1_nc_woof"""
print inputtext

.

import re

print """\n----------------------------------------
WANTED
('1234',   '0040_rev_2_1')
('1122Ae', '1150_rev2_1')
('0124e',  '50')
('1980',   '2234a_2')
('5098',   '2270_2_1')
('7659Ae', '1450sp_rev_2_1')"""
print '----------- eyquem ----------------------'
ri = re.compile('^\D+'
                '(\d{4}[a-zA-Z]{0,3})'
                '[_-]+'
                '(.+?)'
                '(?:[_-]+NC.*)?$',
                re.MULTILINE)

for match in ri.findall(inputtext):
    print match
    
print '----------- Martijn ----------------------'
ro     = re.compile(
              r"""
              ([0-9]{4}
              [A-Z]{0,3})
              [_-]{1,3}
              ([0-9]{2,4}
              [0-9A-Z_-]{0,16}?)
              (?:[-_]NC)?
              """,
              re.IGNORECASE | re.VERBOSE)

for match in ro.findall(inputtext):
    print match

result

----------------------------------------
WANTED
('1234',   '0040_rev_2_1')
('1122Ae', '1150_rev2_1')
('0124e',  '50')
('1980',   '2234a_2')
('5098',   '2270_2_1')
('7659Ae', '1450sp_rev_2_1')
----------- eyquem ----------------------
('1234', '0040_rev_2_1')
('1122Ae', '1150_rev2_1')
('0124e', '50')
('1980', '2234a_2')
('5098', '2270_2_1')
('7659Ae', '1450sp_rev_2_1')
('7659Ae', '1450sp_rev_2_1_nc_woof')
----------- Martijn ----------------------
('1234', '0040')
('1122Ae', '1150')
('0124e', '50')
('1980', '2234')
('5098', '2270')
('7659Ae', '1450')
('7659Ae', '1450')

My regex can be used on individual lines::

for s in inputtext.splitlines(True):
    print ri.match(s).groups()

same result

.

EDIT

import re

inputtext = """XY_efgh_1234_0040_rev_2_1_NC_asdf
XY_abcd_1122Ae_1150_rev2_1_NC
XY_efgh_0124e_50_NC
XY_efgh_0228e_66-__NC
asdf_1980_2234a_2   
asdf_2999_133a
XY_abcd_5098_2270_2_1_NC
XY_abcd_6099_33370_2_1_NC
XY_abcd_6099_3370abcd_2_1_NC
PC_bos_7659Ae_1450sp_rev_2_1_NC_GRAPH
PC_bos_7659Ae_1450sp_rev_2_1___NC_GRAPH
PC_bos_7659Ae_1450sp_rev_2_1_nc_woof_NC
PC_bos_7659Ae_1450sp_rev_2_1_anc_woof_NC
PC_bos_7659Ae_1450sp_rev_2_1_abNC_woof_NC"""

print '----------- Martijn 2 ------------'
ruu     = re.compile(r"""
              ( [0-9]{4} [A-Z]{0,3} )
              [_-]{1,3}
              ( [0-9]{2,4} (?:[0-9A-Z_-](?!NC))* )
              """, re.IGNORECASE | re.VERBOSE)
for match in ruu.findall(inputtext):
    print match
print '----------- eyquem 2 ------------'
rii = re.compile('[_-]'
                '(\d{4}[A-Z]{0,3})'
                '[_-]{1,3}'
                '('
                  '(?=\d{2,4}[A-Z]{0,3}(?![\dA-Z]))'
                  '(?:[0-9A-Z_-]+?)'
                 ')'
                '(?:[-_]+NC.*)?'
                '(?![0-9A-Z_-])',
                re.IGNORECASE)
for m in rii.findall(inputtext):
    print m

result

----------- Martijn 2 ------------
('1234', '0040_rev_2_1')
('1122Ae', '1150_rev2_1')
('0124e', '50')
('0228e', '66-_')
('1980', '2234a_2')
('2999', '133a')
('5098', '2270_2_1')
('6099', '33370_2_1')
('6099', '3370abcd_2_1')
('7659Ae', '1450sp_rev_2_1')
('7659Ae', '1450sp_rev_2_1__')
('7659Ae', '1450sp_rev_2_1')
('7659Ae', '1450sp_rev_2_1_')
('7659Ae', '1450sp_rev_2_1_a')
----------- eyquem 2 ------------
('1234', '0040_rev_2_1')
('1122Ae', '1150_rev2_1')
('0124e', '50')
('0228e', '66')
('1980', '2234a_2')
('2999', '133a')
('5098', '2270_2_1')
('7659Ae', '1450sp_rev_2_1')
('7659Ae', '1450sp_rev_2_1')
('7659Ae', '1450sp_rev_2_1')
('7659Ae', '1450sp_rev_2_1_anc_woof')
('7659Ae', '1450sp_rev_2_1_abNC_woof')

Remarks:

  • my regex doesn't catch '33370_2_1' nor '3370abcd_2_1' because they don't respect the pattern "2 to 4 letters possibly followed by max 3 digits"
    whereas Martijn's solution catches them

  • the ends of the portions catched by my regex are clean; in Martijn's code they aren't

  • Martijn's regex stops in front of every sequence NC or nc, even if it isn't preceded by an underscore, that is to say even when these sequences are letters being part part of the wanted portion.
    If this characteristic of my regex isn't desired, say to me, I will modify it

Upvotes: 2

Martijn Pieters
Martijn Pieters

Reputation: 1121484

You need to use a negative lookahead to match characters that are not followed by NC. Reformatting your regular expression a little to show of the groupings:

pattern     = re.compile(r"""
              ( [0-9]{4} [A-Z]{0,3} )
              [_-]{1,3}
              ( [0-9]{2,4} (?:[0-9A-Z_-](?!NC))* )
              """, re.IGNORECASE | re.VERBOSE)

with the {0,16} replaced with a bold * quantifier, results in:

>>> for match in pattern.findall(inputtext):
...     print match
... 
('1234', '0040_rev_2_1')
('1122Ae', '1150_rev2_1')
('0124e', '50')
('1980', '2234a_2')
('5098', '2270_2_1')
('7659Ae', '1450sp_rev_2_1')

So the (non-capturing) group (?:[0-9A-Z_-](?!NC)) matches any digit, letter, underscore or dash that is not followed by the characters NC.

Upvotes: 3

Related Questions