scdmb
scdmb

Reputation: 15621

Regular expression for repeating sequence

I'd like to match three-character sequences of letters (only letters 'a', 'b', 'c' are allowed) separated by comma (last group is not ended with comma).

Examples:

abc,bca,cbb
ccc,abc,aab,baa
bcb

I have written following regular expression:

re.match('([abc][abc][abc],)+', "abc,defx,df")

However it doesn't work correctly, because for above example:

>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False

It seems only to check first group of three letters but it ignores the rest. How to write this regular expression correctly?

Upvotes: 8

Views: 20248

Answers (8)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626689

To just repeat a sequence of patterns, you need to use a non-capturing group, a (?:...) like contruct, and apply a quantifier right after the closing parenthesis. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group (SO post).

For example:

  • (?:abc)+ matches strings like abc, abcabc, abcabcabc, etc.
  • (?:\d+\.){3} matches strings like 1.12.2., 000.00000.0., etc.

Here, you can use

^[abc]{3}(?:,[abc]{3})*$
          ^^

Note that using a capturing group is fraught with unwelcome effects in a lot of Python regex methods. See a classical issue described at re.findall behaves weird post, for example, where re.findall and all other regex methods using this function behind the scenes only return captured substrings if there is a capturing group in the pattern.

In Pandas, it is also important to use non-capturing groups when you just need to group a pattern sequence: Series.str.contains will complain that this pattern has match groups. To actually get the groups, use str.extract. and the Series.str.extract, Series.str.extractall and Series.str.findall will behave as re.findall.

Upvotes: 1

Karl Knechtel
Karl Knechtel

Reputation: 61478

The obligatory "you don't need a regex" solution:

all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))

Upvotes: 3

Sonya
Sonya

Reputation: 965

What you're asking it to find with your regex is "at least one triple of letters a, b, c" - that's what "+" gives you. Whatever follows after that doesn't really matter to the regex. You might want to include "$", which means "end of the line", to be sure that the line must all consist of allowed triples. However in the current form your regex would also demand that the last triple ends in a comma, so you should explicitly code that it's not so. Try this:

re.match('([abc][abc][abc],)*([abc][abc][abc])$'

This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.

Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.

Upvotes: 4

eyquem
eyquem

Reputation: 27575

If your aim is to validate a string as being composed of triplet of letters a,b,and c:

for ss in ("abc,bbc,abb,baa,bbb",
           "acc",
           "abc,bbc,abb,bXa,bbb",
           "abc,bbc,ab,baa,bbb"):
    print ss,'   ',bool(re.match('([abc]{3},?)+\Z',ss))

result

abc,bbc,abb,baa,bbb     True
acc     True
abc,bbc,abb,bXa,bbb     False
abc,bbc,ab,baa,bbb     False

\Z means: the end of the string. Its presence obliges the match to be until the very end of the string

By the way, I like the form of Sonya too, in a way it is clearer:

bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))

Upvotes: 0

sateesh
sateesh

Reputation: 28663

An alternative without using regex (albeit a brute force way):

>>> def matcher(x):
        total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
            for i in x.split(','):
                if i not in total:
                    return False
         return True

>>> matcher("abc,bca,aaa")
    True
>>> matcher("abc,bca,xyz")
    False
>>> matcher("abc,aaa,bb")
    False

Upvotes: 0

scessor
scessor

Reputation: 16115

Try following regex:

^[abc]{3}(,[abc]{3})*$

^...$ from the start till the end of the string
[...] one of the given character
...{3} three time of the phrase before
(...)* 0 till n times of the characters in the brackets

Upvotes: 18

lig
lig

Reputation: 3890

You need to iterate over sequence of found values.

data_string = "abc,bca,df"

imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)

for match in imatch:
    print match.group('value')

So the regex to check if the string matches pattern will be

data_string = "abc,bca,df"

match = re.match(r'^([abc]{3}(,|$))+', data_string)

if match:
    print "data string is correct"

Upvotes: 2

silvado
silvado

Reputation: 18147

Your result is not surprising since the regular expression

([abc][abc][abc],)+

tries to match a string containing three characters of [abc] followed by a comma one ore more times anywhere in the string. So the most important part is to make sure that there is nothing more in the string - as scessor suggests with adding ^ (start of string) and $ (end of string) to the regular expression.

Upvotes: 1

Related Questions