user3182532
user3182532

Reputation: 1127

Python regex: Lookbehind + Lookahead with characterset

I would like to get the string 10M5D8P into a dictionary:

M:10, D:5, P:8 etc. ...

The string could be longer, but it's always a number followed by a single letter from this alphabet: MIDNSHP=X

As a first step I wanted to split the string with a lookbehind and lookahead, in both cases matching this regex: [0-9]+[MIDNSHP=X]

So my not working solution looks like this at the moment:

import re

re.compile("(?<=[0-9]+[MIDNSHP=X])(?=[0-9]+[MIDNSHP=X])").split("10M5D8P")

It gives me an error message that I do not understand: "look-behind requires fixed-width pattern"

Upvotes: 0

Views: 217

Answers (2)

nneonneo
nneonneo

Reputation: 179402

look-behind requires fixed-width pattern means exactly what it says - a look-behind pattern must match a fixed number of characters in the Python engine. In particular, it is not allowed to contain any quantifiers (?, +, *). Thus, we should pick a fixed-width piece to use as our lookbehind:

(?<=[MIDNSHP=X])(?=\d)

This uses just the single character as the lookbehind and a single digit as the lookahead. However, if you try to split with this expression it will fail due to Python bug 3262. You need to use a workaround like this instead:

>>> re.compile(r"(?<=[MIDNSHP=X])(?=\d)").sub('|', '10M5D8P').split("|")
['10M', '5D', '8P']

but this is pretty ugly. A simpler solution is to use findall to extract what you want:

>>> re.findall('([0-9]+)([MIDNSHP=X])', '10M5D8P')
[('10', 'M'), ('5', 'D'), ('8', 'P')]

from which you can pretty easily create a dictionary:

>>> {k:int(v) for v,k in re.findall('([0-9]+)([MIDNSHP=X])', '10M5D8P')}
{'P': 8, 'M': 10, 'D': 5}

Upvotes: 2

Avinash Raj
Avinash Raj

Reputation: 174696

You may use re.findall.

>>> import re
>>> s = "10M5D8P"
>>> {i[-1]:i[:-1] for i in re.findall(r'[0-9]+[MIDNSHP=X]', s)}
{'M': '10', 'P': '8', 'D': '5'}
>>> {i[-1]:int(i[:-1]) for i in re.findall(r'[0-9]+[MIDNSHP=X]', s)}
{'M': 10, 'P': 8, 'D': 5}

Your regex won't work because re module won't support variable length lookbehind assertions. And also it won't support splitting on zero width boundary, so this (?<=\d)(?=[A-Z]) also can't be possible.

Upvotes: 2

Related Questions