Reputation: 5411
I'm processing strings like this: "125A12C15"
I need to split them at boundaries between letters and numbers, e.g. this one should become ["125","A","12","C","15"]
.
Is there a more elegant way to do this in Python than going through it position by position and checking whether it's a letter or a number, and then concatenating accordingly? E.g. a built-in function or module for this kind of thing?
Thanks for any pointers!
Upvotes: 10
Views: 15469
Reputation: 80346
Use itertools.groupby
together with str.isalpha
method:
Docstring:
groupby(iterable[, keyfunc]) -> create an iterator which returns (key, sub-iterator) grouped by each value of key(value).
Docstring:
S.isalpha() -> bool
Return True if all characters in S are alphabetic and there is at least one character in S, False otherwise.
In [1]: from itertools import groupby
In [2]: s = "125A12C15"
In [3]: [''.join(g) for _, g in groupby(s, str.isalpha)]
Out[3]: ['125', 'A', '12', 'C', '15']
Or possibly re.findall
or re.split
from the regular expressions module:
In [4]: import re
In [5]: re.findall('\d+|\D+', s)
Out[5]: ['125', 'A', '12', 'C', '15']
In [6]: re.split('(\d+)', s) # note that you may have to filter out the empty
# strings at the start/end if using re.split
Out[6]: ['', '125', 'A', '12', 'C', '15', '']
In [7]: re.split('(\D+)', s)
Out[7]: ['125', 'A', '12', 'C', '15']
As for the performance, it seems that using a regex is probably faster:
In [8]: %timeit re.findall('\d+|\D+', s*1000)
100 loops, best of 3: 2.15 ms per loop
In [9]: %timeit [''.join(g) for _, g in groupby(s*1000, str.isalpha)]
100 loops, best of 3: 8.5 ms per loop
In [10]: %timeit re.split('(\d+)', s*1000)
1000 loops, best of 3: 1.43 ms per loop
Upvotes: 32