Reputation: 1255

How can I use Regex to find a string of characters in alphabetical order using Python?

So I have a challenge I'm working on - find the longest string of alphabetical characters in a string. For example, "abcghiijkyxz" should result in "ghiijk" (Yes the i is doubled).

I've been doing quite a bit with loops to solve the problem - iterating over the entire string, then for each character, starting a second loop using lower and ord. No help needed writing that loop.

However, it was suggested to me that Regex would be great for this sort of thing. My regex is weak (I know how to grab a static set, my look-forwards knowledge extends to knowing they exist). How would I write a Regex to look forward, and check future characters for being next in alphabetical order? Or is the suggestion to use Regex not practical for this type of thing?

Edit: The general consensus seems to be that Regex is indeed terrible for this type of thing.

Upvotes: 12

Answers (6)

ctwheels

Reputation: 22817

Just to demonstrate why regex is not practical for this sort of thing, here is a regex that would match ghiijk in your given example of abcghiijkyxz. Note it'll also match abc, y, x, z since they should technically be considered for longest string of alphabetical characters in order. Unfortunately, you can't determine which is the longest with regex alone, but this does give you all the possibilities. Please note that this regex works for PCRE and will not work with python's re module! Also, note that python's regex library does not currently support (*ACCEPT). Although I haven't tested, the pyre2 package (python wrapper for Google's re2 pyre2 using Cython) claims it supports the (*ACCEPT) control verb, so this may currently be possible using python.

See regex in use here

((?:a+(?(?!b)(*ACCEPT))|b+(?(?!c)(*ACCEPT))|c+(?(?!d)(*ACCEPT))|d+(?(?!e)(*ACCEPT))|e+(?(?!f)(*ACCEPT))|f+(?(?!g)(*ACCEPT))|g+(?(?!h)(*ACCEPT))|h+(?(?!i)(*ACCEPT))|i+(?(?!j)(*ACCEPT))|j+(?(?!k)(*ACCEPT))|k+(?(?!l)(*ACCEPT))|l+(?(?!m)(*ACCEPT))|m+(?(?!n)(*ACCEPT))|n+(?(?!o)(*ACCEPT))|o+(?(?!p)(*ACCEPT))|p+(?(?!q)(*ACCEPT))|q+(?(?!r)(*ACCEPT))|r+(?(?!s)(*ACCEPT))|s+(?(?!t)(*ACCEPT))|t+(?(?!u)(*ACCEPT))|u+(?(?!v)(*ACCEPT))|v+(?(?!w)(*ACCEPT))|w+(?(?!x)(*ACCEPT))|x+(?(?!y)(*ACCEPT))|y+(?(?!z)(*ACCEPT))|z+(?(?!$)(*ACCEPT)))+)

Results in:

abc
ghiijk
y
x
z

Explanation of a single option, i.e. a+(?(?!b)(*ACCEPT)):

a+ Matches a (literally) one or more times. This catches instances where several of the same characters are in sequence such as aa.
(?(?!b)(*ACCEPT)) If clause evaluating the condition.
- (?!b) Condition for the if clause. Negative lookahead ensuring what follows is not b. This is because if it's not b, we want the following control verb to take effect.
- (*ACCEPT) If the condition (above) is met, we accept the current solution. This control verb causes the regex to end successfully, skipping the rest of the pattern. Since this token is inside a capturing group, only that capturing group is ended successfully at that particular location, while the parent pattern continues to execute.

So what happens if the condition is not met? Well, that means that (?!b) evaluated to false. This means that the following character is, in fact, b and so we allow the matching (rather capturing in this instance) to continue. Note that the entire pattern is wrapped in (?:)+ which allows us to match consecutive options until the (*ACCEPT) control verb or end of line is met.

The only exception to this whole regular expression is that of z. Being that it's the last character in the English alphabet (which I presume is the target of this question), we don't care what comes after, so we can simply put z+(?(?!$)(*ACCEPT)), which will ensure nothing matches after z. If you, instead, want to match za (circular alphabetical order matching - idk if this is the proper terminology, but it sounds right to me) you can use z+(?(?!a)(*ACCEPT)))+ as seen here.

Upvotes: 10

AndreyS Scherbakov

Reputation: 2788

It's really easy with regexps!

(Using trailing contexts here)

rexp=re.compile(
    "".join(['(?:(?=.' + chr(ord(x)+1) + ')'+ x +')?'
            for x in "abcdefghijklmnopqrstuvwxyz"])
    +'[a-z]')

a = 'bcabhhjabjjbckjkjabckkjdefghiklmn90'

re.findall(rexp, a)

#Answer: ['bc', 'ab', 'h', 'h', 'j', 'ab', 'j', 'j', 'bc', 'k', 'jk', 'j', 'abc', 'k', 'k', 'j', 'defghi', 'klmn']

Upvotes: 0

Waxrat

Reputation: 2185

Generate all the regex substrings like ^a+b+c+$ (longest to shortest). Then match each of those regexs against all the substrings (longest to shortest) of "abcghiijkyxz" and stop at the first match.

def all_substrings(s):
    n = len(s)
    for i in xrange(n, 0, -1):
        for j in xrange(n - i + 1):
            yield s[j:j + i]

def longest_alphabetical_substring(s):
    for t in all_substrings("abcdefghijklmnopqrstuvwxyz"):
        r = re.compile("^" + "".join(map(lambda x: x + "+", t)) + "$")
        for u in all_substrings(s):
            if r.match(u):
                return u

print longest_alphabetical_substring("abcghiijkyxz")

That prints "ghiijk".

Upvotes: 2

Jan

Reputation: 43169

To actually "solve" the problem, you could use

string = 'abcxyzghiijkl'

def sort_longest(string):
    stack = []; result = [];

    for idx, char in enumerate(string):
        c = ord(char)
        if idx == 0:
            # initialize our stack
            stack.append((char, c))
        elif idx == len(string) - 1:
            result.append(stack)
        elif c == stack[-1][1] or c == stack[-1][1] + 1:
            # compare it to the item before (a tuple)
            stack.append((char, c))
        else:
            # append the stack to the overall result
            # and reinitialize the stack
            result.append(stack)
            stack = []
            stack.append((char, c))

    return ["".join(item[0]
        for item in sublst) 
        for sublst in sorted(result, key=len, reverse=True)]

print(sort_longest(string))

Which yields

['ghiijk', 'abc', 'xyz']

in this example.

The idea is to loop over the string and keep track of a stack variable which is filled by your requirements using ord().

Upvotes: 0

Srdjan M.

Reputation: 3405

Regex: char+ meaning a+b+c+...

Details:

+ Matches between one and unlimited times

Python code:

import re

def LNDS(text):
    array = []

    for y in range(97, 122):  # a - z
        st = r"%s+" % chr(y)
        for x in range(y+1, 123):  # b - z
            st += r"%s+" % chr(x)
            match = re.findall(st, text)

            if match:
                array.append(max(match, key=len))
            else:
                break

        if array:
            array = [max(array, key=len)]

    return array

Output:

print(LNDS('abababababab abc')) >>> ['abc']
print(LNDS('abcghiijkyxz')) >>> ['ghiijk']

For string abcghiijkyxz regex pattern:

a+b+                    i+j+k+l+
a+b+c+                  j+k+
a+b+c+d+                j+k+l+
b+c+                    k+l+
b+c+d+                  l+m+
c+d+                    m+n+
d+e+                    n+o+
e+f+                    o+p+
f+g+                    p+q+
g+h+                    q+r+
g+h+i+                  r+s+
g+h+i+j+                s+t+
g+h+i+j+k+              t+u+
g+h+i+j+k+l+            u+v+
h+i+                    v+w+
h+i+j+                  w+x+
h+i+j+k+                x+y+
h+i+j+k+l+              y+z+
i+j+
i+j+k+

Code demo

Upvotes: 1

Imran

Reputation: 13458

As mentioned, regex is not the best tool for this. Since you are interested in a continuous sequence, you can do this with a single for loop:

def LNDS(s):
    start = 0
    cur_len = 1
    max_len = 1
    for i in range(1,len(s)):
        if ord(s[i]) in (ord(s[i-1]), ord(s[i-1])+1):
            cur_len += 1
        else:
            if cur_len > max_len:
                max_len = cur_len
                start = i - cur_len
            cur_len = 1
    if cur_len > max_len:
        max_len = cur_len
        start = len(s) - cur_len
    return s[start:start+max_len]

>>> LNDS('abcghiijkyxz')
'ghiijk'

We keep a running total of how many non-decreasing characters we have seen, and when the non-decreasing sequence ends we compare it to the longest non-decreasing sequence we saw previously, updating our "best seen so far" if it is longer.

Upvotes: 2

How can I use Regex to find a string of characters in alphabetical order using Python?

Answers (6)

Related Questions