Reputation: 39
I cannot figure out for the life of me how to get a regex that replaces all the same repeated characters in a string with a number matching the number of repetitions plus the repeated character for each instances.
For example, let's say I have this string in input: "HB???B???B???B???B???B???B???B???"
I wish to get the following pattern in output: "HB3?B3?B3?B3?B3?B3?B3?B3?"
I am asking this question because I am using jinja2 to make python templates. These python files use the struct standard module and I need to autogenerate likely huge structs based on a spec. I need to unpack all at once because the byte alignment of unpacking single data cause issues on some cpu architectures I am using.
Maybe there is a better solution I have not thought of.
Upvotes: 1
Views: 195
Reputation: 12698
A simple problem as the one you have posted, makes the language definition non regular... what this means is that there's no regex that can match a text matching some regular subexpression and be able to match the same exact string as you matched before (that is a context dependance, and as such, cannot be parsed ---demonstration available at many places--- by a regular expression/finite automaton)
But all is not lost. Many of the libraries allow you to make grouping available, and so, you can create a group, and then refer to it (meaning the same string matched before) in the right part of the same regular expression.
Mathematically, this is not a regular language, and the expression to match it isn't also a regular expression, but the thing works, as it was implemented in the early versions of unix.
HB(...)\1*
Here, a group of ther wilcard characters (any except a newline) '.' is matched, and then any sequence of 0 or more (as per the *
operator) can be attached to it. This will match things like
HBABCABCABCABCABCABC
or
HBBBABBABBABBABBA
but not
HBBBABBABBABB (not complete the sequence of three letters BBA)
See demo
The subexpression in parenthesis can be a valid regular expression. Once it is matched, it is saved to compose the rest of the regular expression, by replacing the \1
group with the matched one. You can achieve even more complicated things than that, the only requirement is that, to use a group in a regular expression, it must refer to something that has been previously matched in that same regular expression (this means that the group reference must be after the closing parenthesis that delimits that group number)
Upvotes: 0
Reputation: 12015
Here is a solution using python re.sub
>>> import re
>>> s = "HB???B???B???B???B???B???B???B???"
>>> re.sub(r'\?+', lambda m: str(len(m.group()))+'?', s)
'HB3?B3?B3?B3?B3?B3?B3?B3?'
Upvotes: 2