Reputation: 1255
So I have a challenge I'm working on - find the longest string of alphabetical characters in a string. For example, "abcghiijkyxz" should result in "ghiijk" (Yes the i is doubled).
I've been doing quite a bit with loops to solve the problem - iterating over the entire string, then for each character, starting a second loop using lower and ord. No help needed writing that loop.
However, it was suggested to me that Regex would be great for this sort of thing. My regex is weak (I know how to grab a static set, my look-forwards knowledge extends to knowing they exist). How would I write a Regex to look forward, and check future characters for being next in alphabetical order? Or is the suggestion to use Regex not practical for this type of thing?
Edit: The general consensus seems to be that Regex is indeed terrible for this type of thing.
Upvotes: 12
Views: 5209
Reputation: 22817
Just to demonstrate why regex is not practical for this sort of thing, here is a regex that would match ghiijk
in your given example of abcghiijkyxz
. Note it'll also match abc
, y
, x
, z
since they should technically be considered for longest string of alphabetical characters in order. Unfortunately, you can't determine which is the longest with regex alone, but this does give you all the possibilities. Please note that this regex works for PCRE and will not work with python's re
module! Also, note that python's regex
library does not currently support (*ACCEPT)
. Although I haven't tested, the pyre2 package (python wrapper for Google's re2 pyre2 using Cython) claims it supports the (*ACCEPT)
control verb, so this may currently be possible using python.
((?:a+(?(?!b)(*ACCEPT))|b+(?(?!c)(*ACCEPT))|c+(?(?!d)(*ACCEPT))|d+(?(?!e)(*ACCEPT))|e+(?(?!f)(*ACCEPT))|f+(?(?!g)(*ACCEPT))|g+(?(?!h)(*ACCEPT))|h+(?(?!i)(*ACCEPT))|i+(?(?!j)(*ACCEPT))|j+(?(?!k)(*ACCEPT))|k+(?(?!l)(*ACCEPT))|l+(?(?!m)(*ACCEPT))|m+(?(?!n)(*ACCEPT))|n+(?(?!o)(*ACCEPT))|o+(?(?!p)(*ACCEPT))|p+(?(?!q)(*ACCEPT))|q+(?(?!r)(*ACCEPT))|r+(?(?!s)(*ACCEPT))|s+(?(?!t)(*ACCEPT))|t+(?(?!u)(*ACCEPT))|u+(?(?!v)(*ACCEPT))|v+(?(?!w)(*ACCEPT))|w+(?(?!x)(*ACCEPT))|x+(?(?!y)(*ACCEPT))|y+(?(?!z)(*ACCEPT))|z+(?(?!$)(*ACCEPT)))+)
Results in:
abc
ghiijk
y
x
z
Explanation of a single option, i.e. a+(?(?!b)(*ACCEPT))
:
a+
Matches a
(literally) one or more times. This catches instances where several of the same characters are in sequence such as aa
.(?(?!b)(*ACCEPT))
If clause evaluating the condition.
(?!b)
Condition for the if clause. Negative lookahead ensuring what follows is not b
. This is because if it's not b
, we want the following control verb to take effect.(*ACCEPT)
If the condition (above) is met, we accept the current solution. This control verb causes the regex to end successfully, skipping the rest of the pattern. Since this token is inside a capturing group, only that capturing group is ended successfully at that particular location, while the parent pattern continues to execute.So what happens if the condition is not met? Well, that means that (?!b)
evaluated to false. This means that the following character is, in fact, b
and so we allow the matching (rather capturing in this instance) to continue. Note that the entire pattern is wrapped in (?:)+
which allows us to match consecutive options until the (*ACCEPT)
control verb or end of line is met.
The only exception to this whole regular expression is that of z
. Being that it's the last character in the English alphabet (which I presume is the target of this question), we don't care what comes after, so we can simply put z+(?(?!$)(*ACCEPT))
, which will ensure nothing matches after z
. If you, instead, want to match za
(circular alphabetical order matching - idk if this is the proper terminology, but it sounds right to me) you can use z+(?(?!a)(*ACCEPT)))+
as seen here.
Upvotes: 10
Reputation: 2788
It's really easy with regexps!
(Using trailing contexts here)
rexp=re.compile(
"".join(['(?:(?=.' + chr(ord(x)+1) + ')'+ x +')?'
for x in "abcdefghijklmnopqrstuvwxyz"])
+'[a-z]')
a = 'bcabhhjabjjbckjkjabckkjdefghiklmn90'
re.findall(rexp, a)
#Answer: ['bc', 'ab', 'h', 'h', 'j', 'ab', 'j', 'j', 'bc', 'k', 'jk', 'j', 'abc', 'k', 'k', 'j', 'defghi', 'klmn']
Upvotes: 0
Reputation: 2185
Generate all the regex substrings like ^a+b+c+$ (longest to shortest). Then match each of those regexs against all the substrings (longest to shortest) of "abcghiijkyxz" and stop at the first match.
def all_substrings(s):
n = len(s)
for i in xrange(n, 0, -1):
for j in xrange(n - i + 1):
yield s[j:j + i]
def longest_alphabetical_substring(s):
for t in all_substrings("abcdefghijklmnopqrstuvwxyz"):
r = re.compile("^" + "".join(map(lambda x: x + "+", t)) + "$")
for u in all_substrings(s):
if r.match(u):
return u
print longest_alphabetical_substring("abcghiijkyxz")
That prints "ghiijk".
Upvotes: 2
Reputation: 43169
To actually "solve" the problem, you could use
string = 'abcxyzghiijkl'
def sort_longest(string):
stack = []; result = [];
for idx, char in enumerate(string):
c = ord(char)
if idx == 0:
# initialize our stack
stack.append((char, c))
elif idx == len(string) - 1:
result.append(stack)
elif c == stack[-1][1] or c == stack[-1][1] + 1:
# compare it to the item before (a tuple)
stack.append((char, c))
else:
# append the stack to the overall result
# and reinitialize the stack
result.append(stack)
stack = []
stack.append((char, c))
return ["".join(item[0]
for item in sublst)
for sublst in sorted(result, key=len, reverse=True)]
print(sort_longest(string))
Which yields
['ghiijk', 'abc', 'xyz']
in this example.
stack
variable which is filled by your requirements using ord()
.
Upvotes: 0
Reputation: 3405
Regex: char+
meaning a+b+c+...
Details:
+
Matches between one and unlimited timesPython code:
import re
def LNDS(text):
array = []
for y in range(97, 122): # a - z
st = r"%s+" % chr(y)
for x in range(y+1, 123): # b - z
st += r"%s+" % chr(x)
match = re.findall(st, text)
if match:
array.append(max(match, key=len))
else:
break
if array:
array = [max(array, key=len)]
return array
Output:
print(LNDS('abababababab abc')) >>> ['abc']
print(LNDS('abcghiijkyxz')) >>> ['ghiijk']
For string abcghiijkyxz
regex pattern:
a+b+ i+j+k+l+
a+b+c+ j+k+
a+b+c+d+ j+k+l+
b+c+ k+l+
b+c+d+ l+m+
c+d+ m+n+
d+e+ n+o+
e+f+ o+p+
f+g+ p+q+
g+h+ q+r+
g+h+i+ r+s+
g+h+i+j+ s+t+
g+h+i+j+k+ t+u+
g+h+i+j+k+l+ u+v+
h+i+ v+w+
h+i+j+ w+x+
h+i+j+k+ x+y+
h+i+j+k+l+ y+z+
i+j+
i+j+k+
Upvotes: 1
Reputation: 13458
As mentioned, regex is not the best tool for this. Since you are interested in a continuous sequence, you can do this with a single for loop:
def LNDS(s):
start = 0
cur_len = 1
max_len = 1
for i in range(1,len(s)):
if ord(s[i]) in (ord(s[i-1]), ord(s[i-1])+1):
cur_len += 1
else:
if cur_len > max_len:
max_len = cur_len
start = i - cur_len
cur_len = 1
if cur_len > max_len:
max_len = cur_len
start = len(s) - cur_len
return s[start:start+max_len]
>>> LNDS('abcghiijkyxz')
'ghiijk'
We keep a running total of how many non-decreasing characters we have seen, and when the non-decreasing sequence ends we compare it to the longest non-decreasing sequence we saw previously, updating our "best seen so far" if it is longer.
Upvotes: 2