Reputation: 31171
I have the following string inputs:
"11A4B"
"5S6B"
And want the following outputs:
["11A", "4B"]
["5S", "6B"]
Eg after each delimiter A, B or S split and keep the delimiter.
I can do with split
from re
(putting parenthesis on the delimiter pattern returns also the delimiter used):
re.split("([ABS])", "11A4B")
#['11', 'A', '4', 'B', '']
And can play around to have the wanted solution but I wonder if there is a pure regex solution?
Upvotes: 2
Views: 348
Reputation: 626758
A solution that will work in all Python versions will be the one based on PyPi regex
module with regex.split
and regex.V1
flag:
import regex
ss = ["11A4B","5S6B"]
delimiters = "ABS"
for s in ss:
print(regex.split(r'(?<=[{}])(?!$)'.format(regex.escape(delimiters)), s, flags=regex.V1))
['11A', '4B']
['5S', '6B']
Details
(?<=[ABS])
- a positive lookbehind that matches a location that is immediately preceded with A
, B
or S
(?!$)
- and that is not immediately followed with the end of string (so, all locations at the end of the string are failed).The regex.escape
is used just in case there may be special regex chars in the delimiter list, like ^
, \
, -
or ]
.
In Python 3.7, re.split
also can split with zero-length matches, so, the following will work, too:
re.split(r'(?<=[{}])(?!$)'.format(re.escape(delimiters)), s)
Else, you may use workarounds, like
re.findall(r'[^ABS]*[ABS]?', s) # May result in empty items, too
re.findall(r'(?s)(?=.)[^ABS]*[ABS]?', s) # no empty items due to the lookahead requiring at least 1 char
See the regex demo.
Details
(?s)
- .
matches newlines, too(?=.)
- one char should appear immediately to the right of the current location[^ABS]*
- any 0+ chars other than A
, B
and S
[ABS]?
- 1 or 0 (=optional) A
, B
or S
char.Upvotes: 4
Reputation: 370699
Use re.findall
instead, and match digits followed by either A
, B
, or S
:
re.findall(r'\d+[ABS]', '11A4B')
Output:
['11A', '4B']
If the input might have other alphabetical characters as well, then use a negated character set instead:
re.findall(r'[^ABS]+[ABS]', 'ZZZAYYYSXXXB')
Output:
['ZZZA', 'YYYS', 'XXXB']
Upvotes: 3