Crista23
Crista23

Reputation: 3243

Python Regular expression for splitting mentions of two years appearing altogether

I have the following case, where in my string I have improperly formatted mentions of the form "(19561958)" that I would like to split into "(1956-1958)". The regular expression that I tried is:

import re
a = "(19561958)"
re.sub(r"(\d\d\d\d\d\d\d\d)", r"\1-", a)

but this returns me "(19561958-)". How can I achieve my purpose? Many thanks!

Upvotes: 1

Views: 58

Answers (3)

jonrsharpe
jonrsharpe

Reputation: 122032

You could capture the two years separately, and insert the hyphen between the two groups:

>>> import re
>>> re.sub(r'(\d{4})(\d{4})', r'\1-\2', '(19561958)')
'(1956-1958)'

Note that \d\d\d\d is written more concisely as \d{4}.


As currently written, this will insert a hyphen between the first two groups of four in any eight-digit-plus number. If you require the parentheses for the match, you can include them explicitly with look-arounds:

>>> re.sub(r'''
    (?<=\() # make sure there's an opening parenthesis prior to the groups
    (\d{4}) # one group of four digits
    (\d{4}) # and a second group of four digits
    (?=\))  # with a closing parenthesis after the two groups 
''', r'\1-\2', '(19561958)', flags=re.VERBOSE)
'(1956-1958)'

Alternatively, you could use word boundaries, which would also deal with e.g. spaces around an eight-digit number:

>>> re.sub(r'\b(\d{4})(\d{4})\b', r'\1-\2', '(19561958)')
'(1956-1958)'

Upvotes: 2

Avinash Raj
Avinash Raj

Reputation: 174706

You could use capturing groups or look arounds.

re.sub(r"\((\d{4})(\d{4})\)", r"(\1-\2)", a)

\d{4} matches exactly 4 digits.

Example:

>>> a = "(19561958)"
>>> re.sub(r"\((\d{4})(\d{4})\)", r"(\1-\2)", a)
'(1956-1958)'

OR

Through lookarounds.

>>> a = "(19561958)"
>>> re.sub(r"(?<=\(\d{4})(?=\d{4}\))", r"-", a)
'(1956-1958)'
  • (?<=\(\d{4}) Positive lookbehind which asserts that the match must be preceded by ( and four digit characters.

  • (?=\d{4}\)) Posiitve lookahead which asserts that the match must be followed by 4 digits plus ) symbol.

  • Here a boundary got matched. Replacing the matched boundary with - will give you the desired output.

Upvotes: 2

aneroid
aneroid

Reputation: 15962

Use two capturing groups: r"(\d\d\d\d)(\d\d\d\d)" or r"(\d{4})(\d{4})".

The 2nd group is referenced with \2.

Upvotes: 2

Related Questions