Reputation: 3243
I have the following case, where in my string I have improperly formatted mentions of the form "(19561958)" that I would like to split into "(1956-1958)". The regular expression that I tried is:
import re
a = "(19561958)"
re.sub(r"(\d\d\d\d\d\d\d\d)", r"\1-", a)
but this returns me "(19561958-)". How can I achieve my purpose? Many thanks!
Upvotes: 1
Views: 58
Reputation: 122032
You could capture the two years separately, and insert the hyphen between the two groups:
>>> import re
>>> re.sub(r'(\d{4})(\d{4})', r'\1-\2', '(19561958)')
'(1956-1958)'
Note that \d\d\d\d
is written more concisely as \d{4}
.
As currently written, this will insert a hyphen between the first two groups of four in any eight-digit-plus number. If you require the parentheses for the match, you can include them explicitly with look-arounds:
>>> re.sub(r'''
(?<=\() # make sure there's an opening parenthesis prior to the groups
(\d{4}) # one group of four digits
(\d{4}) # and a second group of four digits
(?=\)) # with a closing parenthesis after the two groups
''', r'\1-\2', '(19561958)', flags=re.VERBOSE)
'(1956-1958)'
Alternatively, you could use word boundaries, which would also deal with e.g. spaces around an eight-digit number:
>>> re.sub(r'\b(\d{4})(\d{4})\b', r'\1-\2', '(19561958)')
'(1956-1958)'
Upvotes: 2
Reputation: 174706
You could use capturing groups or look arounds.
re.sub(r"\((\d{4})(\d{4})\)", r"(\1-\2)", a)
\d{4}
matches exactly 4 digits.
Example:
>>> a = "(19561958)"
>>> re.sub(r"\((\d{4})(\d{4})\)", r"(\1-\2)", a)
'(1956-1958)'
OR
Through lookarounds.
>>> a = "(19561958)"
>>> re.sub(r"(?<=\(\d{4})(?=\d{4}\))", r"-", a)
'(1956-1958)'
(?<=\(\d{4})
Positive lookbehind which asserts that the match must be preceded by (
and four digit characters.
(?=\d{4}\))
Posiitve lookahead which asserts that the match must be followed by 4 digits plus )
symbol.
Here a boundary got matched. Replacing the matched boundary with -
will give you the desired output.
Upvotes: 2
Reputation: 15962
Use two capturing groups: r"(\d\d\d\d)(\d\d\d\d)"
or r"(\d{4})(\d{4})"
.
The 2nd group is referenced with \2
.
Upvotes: 2