Colonel Beauvel
Colonel Beauvel

Reputation: 31171

Splitting by multiple separator and keep separator

I have the following string inputs:

"11A4B"
"5S6B"

And want the following outputs:

["11A", "4B"]
["5S", "6B"]

Eg after each delimiter A, B or S split and keep the delimiter.

I can do with split from re (putting parenthesis on the delimiter pattern returns also the delimiter used):

re.split("([ABS])", "11A4B")
#['11', 'A', '4', 'B', '']

And can play around to have the wanted solution but I wonder if there is a pure regex solution?

Upvotes: 2

Views: 348

Answers (4)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626758

A solution that will work in all Python versions will be the one based on PyPi regex module with regex.split and regex.V1 flag:

import regex
ss = ["11A4B","5S6B"]
delimiters = "ABS"
for s in ss:
    print(regex.split(r'(?<=[{}])(?!$)'.format(regex.escape(delimiters)), s, flags=regex.V1))

Output:

['11A', '4B']
['5S', '6B']

Details

  • (?<=[ABS]) - a positive lookbehind that matches a location that is immediately preceded with A , B or S
  • (?!$) - and that is not immediately followed with the end of string (so, all locations at the end of the string are failed).

The regex.escape is used just in case there may be special regex chars in the delimiter list, like ^, \, - or ].

In Python 3.7, re.split also can split with zero-length matches, so, the following will work, too:

re.split(r'(?<=[{}])(?!$)'.format(re.escape(delimiters)), s)

Else, you may use workarounds, like

re.findall(r'[^ABS]*[ABS]?', s) # May result in empty items, too
re.findall(r'(?s)(?=.)[^ABS]*[ABS]?', s) # no empty items due to the lookahead requiring at least 1 char

See the regex demo.

Details

  • (?s) - . matches newlines, too
  • (?=.) - one char should appear immediately to the right of the current location
  • [^ABS]* - any 0+ chars other than A, B and S
  • [ABS]? - 1 or 0 (=optional) A, B or S char.

Upvotes: 4

Jan
Jan

Reputation: 43169

You could use lookarounds:

(?<=[ABS])(?!$)

Se a demo on regex101.com.

Upvotes: 3

CertainPerformance
CertainPerformance

Reputation: 370699

Use re.findall instead, and match digits followed by either A, B, or S:

re.findall(r'\d+[ABS]', '11A4B')

Output:

['11A', '4B']

If the input might have other alphabetical characters as well, then use a negated character set instead:

re.findall(r'[^ABS]+[ABS]', 'ZZZAYYYSXXXB')

Output:

['ZZZA', 'YYYS', 'XXXB']

Upvotes: 3

Daniel
Daniel

Reputation: 42748

Use findall:

re.findall('(.*?(?:[ABS]|.$))', "11A4B5")

Upvotes: 1

Related Questions