Reputation: 105

What is the fastest way to restrict a number of repeating characters with different length?

I would like to restrict number of repeating characters in a string given that different characters have different restrictions.

Suppose, I have a string Mary,,, had!!!!! a--- little ? lamb........ and list of characters that are allowed to have a higher number of restriction chars = '.!?'. This means that I want to have all punctuation signs like ,- (suppose I have a list of those) to occur only once in a row, while characters from chars can occur max 3 times in a row.

Thus the final string will be formatted like this: Mary, had!!! a- little ? lamb...

Could anyone give me a hint what is the fastest way to do that, please? I suppose I will have to use groupby from itertools, but I can't quite wrap my head around it. Any tips are appreciated! Thank you in advance!

Upvotes: 0

Answers (4)

trincot

Reputation: 350310

Another solution with re.sub that goes without callback function:

import re

only_once = ',-'
only_thrice = '.!?'

regex = f"([{re.escape(only_once)}])\\1+|([{re.escape(only_thrice)}])\\2{{3,}}"

# example
s = 'Mary,,, had!!!!! a--- little ? lamb........'
result = re.sub(regex, r"\1\2\2\2", s)

Explanation

First of all re.escape is called on the two strings to avoid that any character is interpreted as a special character by the regular expression language, instead of being treated as a literal character.

By wrapping the characters in square brackets, we create a literal character class.

By wrapping inside parentheses we create a capture group.

The \1 is a backreference to the first group. This backslash is escaped because Python will otherwise interpret the backslash as an escape character, and the regex wouldn't get to see it. The backreference means that the same character that was captured in the capture group should be matched again. \1+ means it should be captured at least once, but as many times as possible.

The \2 is a backreference to the second group. {3,} means that the character that was captured in the (second) capture group should be captured at least 3 times more and as many times as possible. The braces were duplicated, because otherwise the f-string would interpret them as interpolation instructions, while we want them to be passed on to the regular expression engine. This is the way f-strings will produce single braces.

The second argument of re.sub is the replacement string, which also can use backreferences. Every match will either represent a repeated sequence of the first type of characters, or a sequence of the second type of characters (exclusive or), each time a repeated sequence that is too long. In either case, exactly one of the capture groups is non-empty. So when we have \1\2\2\2 we actually say: produce \1 when this match is about the first case, or produce \2\2\2 when it is about the second case. So in the first case we want to replace the repeated character chain with just one occurrence of that character, and in the second case we want to replace that chain with three occurrences of that single character. In either case, the result is shorter than what was matched.

Upvotes: 0

dawg

Reputation: 103874

I would just use a simple loop by limited character personally:

import re
max_counts={**{}.fromkeys(".!?", 3),**{}.fromkeys("-,",1)}
# {'.': 3, '!': 3, '?': 3, '-': 1, ',': 1}

s= 'Mary,,, had!!!!! a--- little ? lamb........'

for c,cnt in max_counts.items():
    s=re.sub(rf'{re.escape(c)}{{{cnt+1},}}',c*cnt, s)

>>> s
'Mary, had!!! a- little ? lamb...'

Better, you can invert the mapping between the character=>count to count=>character but defining your dict like so:

invert={3: '.!?', 1: '-,'}
# invert={3: ['.', '!', '?'], 1: ['-', ',']} works too for the code below

Then, fewer loops, you can do:

for cnt,chars in invert.items():
    s=re.sub(rf'([{"".join(chars)}])\1{{{cnt},}}', r'\1'*cnt, s)

The codes:

rf'([{"".join(chars)}])\1{{{cnt},}}'

constructs a literal regex of this form:

([.!?])\1{3,}

That regex only matches substrings of length greater than the allowed cnt defined. Since the special characters are inside a [ class ] they do not need to be escaped other than ^-\[]. If you have any of those, do:

for cnt,chars in invert.items():
    s=re.sub(rf'([{"".join(map(re.escape, chars))}])\1{{{cnt},}}', r'\1'*cnt, s)

Upvotes: 0

Alain T.

Reputation: 42143

You could indeed use groupby and setup a dictionary of number of allowed repetition for characters that have a restriction:

from itertools import groupby,islice
from collections import Counter

maxRep  = Counter(",-"*1 + ".!?"*3)

output:

S = "Mary,,, had!!!!! a--- little ? lamb........"

S = "".join(c for g,r in groupby(S) for c in islice(r,0,maxRep.get(g)))

print(S)
# Mary, had!!! a- little ? lamb...

groupby will yield groups of repeated characters with the character in g and an iterator on the repetititons in r.
islice will go through the repetitions (r) from index 0 up to the maximum provided by the maxRep dictionary.
using maxRep.get(g) will either return the restriction or None if the character is not restricted.
With None the slice will go all the way to the end of r thus not restricting the repetitions for characters that are not in maxRep.
"".join(...) puts the character produced by the generator expression together in the resulting string.

Note that this is slower than regular expressions (using the re module). However, if you want to use regular expressions, it will be simpler and faster to perform clean-ups by deleting superfluous characters than replacing repetitions with their maximum steaks

import re
pattern = "[{0}]+(?=[{0}]{{{1},{1}}})"    # look ahead for x reps
max1 = pattern.format(r",-",1)            # [,-]+(?=[,-]{1,1})
max3 = pattern.format(r".!?",3)           # [.!?]+(?=[.!?]{3,3})
restrictions = re.compile(max1+"|"+max3)

Note that you will have to use escaping if you want restrictions on characters that need to be escaped within a character class in a regular expression (e.g. a closing square bracket: r"\]")

output:

S = "Mary,,, had!!!!! a--- little ? lamb........"

S = restrictions.sub("",S)

print(S)
# Mary, had!!! a- little ? lamb...

This is roughly 3x faster than the groupby solution

Upvotes: 0

a_guest

Reputation: 36249

You can use re.sub together with a lambda function which handles the replacement logic:

import re

n_max = {**dict.fromkeys('-,', 1), **dict.fromkeys('.!?', 3)}

test_string = 'Mary,,, had!!!!! a--- little ? lamb........'
result = re.sub(
    r'([{chars}])\1+'.format(chars=''.join(re.escape(c) for c in n_max)),
    lambda m: m.group(0)[:n_max[m.group(1)]],
    test_string,
)

Upvotes: 1

What is the fastest way to restrict a number of repeating characters with different length?

Answers (4)

Explanation

Related Questions