Artyom Vancyan
Artyom Vancyan

Reputation: 5388

How to find all possible uniform substrings of a string?

I have a string like

aaabbbbcca

And I'd like to parse all possible uniform substrings from that. So my expected substrings for this string are

['a', 'aa', 'aaa', 'b', 'bb', 'bbb', 'bbbb', 'c', 'cc', 'a']

I tried the following

import re

print(re.findall(r"([a-z])(?=\1*)", "aaabbbbcca"))
# Output: ['a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c', 'a']

Is it possible trough regular expressions? If yes, then how?

Upvotes: 12

Views: 1041

Answers (4)

timgeb
timgeb

Reputation: 78790

You can use a regex to find streaks of the same character, and then some Python on top to build the smaller streaks.

import re

s = 'aaabbbbcca'
matches = (m.group() for m in re.finditer(r'([a-z])\1*', s))
result = [m[:i] for m in matches for i in range(1, len(m) + 1)]

There's also an itertools solution.

from itertools import groupby                                               
s = 'aaabbbbcca'                                                            
matches = (''.join(g) for _, g in groupby(s))                               
result = [m[:i] for m in matches for i in range(1, len(m) + 1)]  

Upvotes: 7

Pychopath
Pychopath

Reputation: 1580

Using two itertools functions:

from itertools import groupby, accumulate

s = 'aaabbbbcca'

print([a for _, g in groupby(s) for a in accumulate(g)])

Or just with basics:

s = 'aaabbbbcca'

a = ''
print([a := a * (c in a) + c for c in s])

Output for both:

['a', 'aa', 'aaa', 'b', 'bb', 'bbb', 'bbbb', 'c', 'cc', 'a']

Upvotes: 3

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627507

You can achieve what you need without a regex here:

result = []
text = "aaabbbbcca"
prev = ''
for c in text:
  if c == prev:
    result.append(result[-1] + c)
  else:
    result.append(c)
    prev = c
 
print(result)
# => ['a', 'aa', 'aaa', 'b', 'bb', 'bbb', 'bbbb', 'c', 'cc', 'a']

See the Python demo.

In short, you can iterate over the string and append new item to a result list when the new char is not equal to the previous char, otherwise, append a new item with the value equal to the previous item + the same char concatenated to the value.

With regex, the best you can do is

import re
text = "aaabbbbcca"
print( [x.group(1) for x in re.finditer(r'(?=((.)\2*))', text)] )
# => ['aaa', 'aa', 'a', 'bbbb', 'bbb', 'bb', 'b', 'cc', 'c', 'a']

See this Python demo. Here, (?=((.)\2*)) matches any location inside the string that is immediately preceded with any one char (other than line break chars if you do not use re.DOTALL option) that is followed with zero or more occurrences of the same char (capturing the char(s) into Group 1).

Upvotes: 7

JANO
JANO

Reputation: 3076

I think this particular problem can be solved with a regex. The answer is based on this answer, where parts of numbers are extracted. The explanation is the same as in the other answer. Each match creates an empty group and a group within the lookahead. The lookahead captures sequences of a, b or c of at least length 1. Afterward, we simply create a list of strings that are in the second group.

import re 

s = "aaabbbbcca"
matches = re.finditer(r'(?=(a{1,}|b{1,}|c{1,}))',s)
results = [match.group(1) for match in matches]
print(results)

Output:

['aaa', 'aa', 'a', 'bbbb', 'bbb', 'bb', 'b', 'cc', 'c', 'a']

The values of the output are the same as requested, but not the exact same order.

Upvotes: 1

Related Questions