Reputation: 105
I would like to restrict number of repeating characters in a string given that different characters have different restrictions.
Suppose, I have a string
Mary,,, had!!!!! a--- little ? lamb........
and list of characters that are allowed to have a higher number of restriction chars = '.!?'
. This means that I want to have all punctuation signs like ,-
(suppose I have a list of those) to occur only once in a row, while characters from chars
can occur max 3 times in a row.
Thus the final string will be formatted like this:
Mary, had!!! a- little ? lamb...
Could anyone give me a hint what is the fastest way to do that, please? I suppose I will have to use groupby
from itertools
, but I can't quite wrap my head around it. Any tips are appreciated! Thank you in advance!
Upvotes: 0
Views: 88
Reputation: 350310
Another solution with re.sub
that goes without callback function:
import re
only_once = ',-'
only_thrice = '.!?'
regex = f"([{re.escape(only_once)}])\\1+|([{re.escape(only_thrice)}])\\2{{3,}}"
# example
s = 'Mary,,, had!!!!! a--- little ? lamb........'
result = re.sub(regex, r"\1\2\2\2", s)
First of all re.escape
is called on the two strings to avoid that any character is interpreted as a special character by the regular expression language, instead of being treated as a literal character.
By wrapping the characters in square brackets, we create a literal character class.
By wrapping inside parentheses we create a capture group.
The \1
is a backreference to the first group. This backslash is escaped because Python will otherwise interpret the backslash as an escape character, and the regex wouldn't get to see it. The backreference means that the same character that was captured in the capture group should be matched again. \1+
means it should be captured at least once, but as many times as possible.
The \2
is a backreference to the second group. {3,}
means that the character that was captured in the (second) capture group should be captured at least 3 times more and as many times as possible. The braces were duplicated, because otherwise the f
-string would interpret them as interpolation instructions, while we want them to be passed on to the regular expression engine. This is the way f
-strings will produce single braces.
The second argument of re.sub
is the replacement string, which also can use backreferences. Every match will either represent a repeated sequence of the first type of characters, or a sequence of the second type of characters (exclusive or), each time a repeated sequence that is too long. In either case, exactly one of the capture groups is non-empty. So when we have \1\2\2\2
we actually say: produce \1
when this match is about the first case, or produce \2\2\2
when it is about the second case. So in the first case we want to replace the repeated character chain with just one occurrence of that character, and in the second case we want to replace that chain with three occurrences of that single character. In either case, the result is shorter than what was matched.
Upvotes: 0
Reputation: 103874
I would just use a simple loop by limited character personally:
import re
max_counts={**{}.fromkeys(".!?", 3),**{}.fromkeys("-,",1)}
# {'.': 3, '!': 3, '?': 3, '-': 1, ',': 1}
s= 'Mary,,, had!!!!! a--- little ? lamb........'
for c,cnt in max_counts.items():
s=re.sub(rf'{re.escape(c)}{{{cnt+1},}}',c*cnt, s)
>>> s
'Mary, had!!! a- little ? lamb...'
Better, you can invert the mapping between the character=>count
to count=>character
but defining your dict like so:
invert={3: '.!?', 1: '-,'}
# invert={3: ['.', '!', '?'], 1: ['-', ',']} works too for the code below
Then, fewer loops, you can do:
for cnt,chars in invert.items():
s=re.sub(rf'([{"".join(chars)}])\1{{{cnt},}}', r'\1'*cnt, s)
The codes:
rf'([{"".join(chars)}])\1{{{cnt},}}'
constructs a literal regex of this form:
([.!?])\1{3,}
That regex only matches substrings of length greater than the allowed cnt
defined. Since the special characters are inside a [ class ]
they do not need to be escaped other than ^-\[]
. If you have any of those, do:
for cnt,chars in invert.items():
s=re.sub(rf'([{"".join(map(re.escape, chars))}])\1{{{cnt},}}', r'\1'*cnt, s)
Upvotes: 0
Reputation: 42143
You could indeed use groupby and setup a dictionary of number of allowed repetition for characters that have a restriction:
from itertools import groupby,islice
from collections import Counter
maxRep = Counter(",-"*1 + ".!?"*3)
output:
S = "Mary,,, had!!!!! a--- little ? lamb........"
S = "".join(c for g,r in groupby(S) for c in islice(r,0,maxRep.get(g)))
print(S)
# Mary, had!!! a- little ? lamb...
groupby
will yield groups of repeated characters with the character in g
and an iterator on the repetititons in r
.islice
will go through the repetitions (r
) from index 0 up to the maximum provided by the maxRep
dictionary.maxRep.get(g)
will either return the restriction or None
if the character is not restricted.None
the slice will go all the way to the end of r
thus not restricting the repetitions for characters that are not in maxRep
."".join(...)
puts the character produced by the generator expression together in the resulting string.Note that this is slower than regular expressions (using the re module). However, if you want to use regular expressions, it will be simpler and faster to perform clean-ups by deleting superfluous characters than replacing repetitions with their maximum steaks
import re
pattern = "[{0}]+(?=[{0}]{{{1},{1}}})" # look ahead for x reps
max1 = pattern.format(r",-",1) # [,-]+(?=[,-]{1,1})
max3 = pattern.format(r".!?",3) # [.!?]+(?=[.!?]{3,3})
restrictions = re.compile(max1+"|"+max3)
Note that you will have to use escaping if you want restrictions on characters that need to be escaped within a character class in a regular expression (e.g. a closing square bracket: r"\]"
)
output:
S = "Mary,,, had!!!!! a--- little ? lamb........"
S = restrictions.sub("",S)
print(S)
# Mary, had!!! a- little ? lamb...
This is roughly 3x faster than the groupby solution
Upvotes: 0
Reputation: 36249
You can use re.sub
together with a lambda function which handles the replacement logic:
import re
n_max = {**dict.fromkeys('-,', 1), **dict.fromkeys('.!?', 3)}
test_string = 'Mary,,, had!!!!! a--- little ? lamb........'
result = re.sub(
r'([{chars}])\1+'.format(chars=''.join(re.escape(c) for c in n_max)),
lambda m: m.group(0)[:n_max[m.group(1)]],
test_string,
)
Upvotes: 1