Reputation: 7619
I have an input string like this: a1b2c30d40
and I want to tokenize the string to: a, 1, b, 2, c, 30, d, 40
.
I know I can read each character one by one and keep track of the previous character to determine if I should tokenize it or not (2 digits in a row means don't tokenize it) but is there a more pythonic way of doing this?
Upvotes: 7
Views: 1029
Reputation: 129754
>>> re.split(r'(\d+)', 'a1b2c30d40')
['a', '1', 'b', '2', 'c', '30', 'd', '40', '']
On the pattern: as the comment says, \d
means "match one digit", +
is a modifier that means "match one or more", so \d+
means "match as much digits as possible". This is put into a group ()
, so the entire pattern in context of re.split
means "split this string using as much digits as possible as the separator, additionally capturing matched separators into the result". If you'd omit the group, you'd get ['a', 'b', 'c', 'd', '']
.
Upvotes: 13