lilgallon
lilgallon

Reputation: 625

How to split a string and keep the separators in it

@edzech asked how was it possible to split a string and keep the separators in it. His question was marked as duplicate, whereas the approach here is different than the "duplicate".

We want to split a string but by keeping the delimiters in it, we don't want them to be separated. In brief, for <abc>d<e><f>ghi<j>, we want:

['<abc>', 'd', '<e>', '<f>', 'ghi', '<j>']

instead of:

['<', 'abc', '>' 'd', '<', 'e', '>', '<', 'f', '>', 'ghi', '<', 'j', '>']

Using split does not help since it will split according to the separator. We want to keep it attached to its content.

Upvotes: 1

Views: 141

Answers (3)

user557597
user557597

Reputation:

I believe you can use split with this regex

(?<=>)(?=[a-z<])|(?<=[a-z>])(?=<)

https://regex101.com/r/WNy5n9/1

It's nothing more than 2 option's with paired lookbehind/ahead assertions.

Expanded

   (?<= > )                      # Behind a  >
   (?= [a-z<] )                  # Ahead either a-z or <
|                              # or,
   (?<= [a-z>] )                 # Behind either a-z or >
   (?= < )                       # Ahead a  <

Update
Note that in versions of Python prior to version 3.7 splitting
on an empty match was not handled correctly.
Presumably they couldn't tell the difference between an empty
string and / or how to do the bump along on zero-width matches.

Seems like they pulled their heads out of their a** now in version 3.7,
so here you go..

Demo

Version 3.7.3

>>> import sys
>>> print( sys.version )
3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 21:26:53) [MSC v.1916 32 bit (Intel)]

Code

>>> import re
>>> rx = re.compile( r"(?<=>)(?=[a-z<])|(?<=[a-z>])(?=<)" )
>>> s = "<abc>d<e><f>ghi<j>test><g>"
>>> x =  re.split( rx, s )
>>> print ( x )
['<abc>', 'd', '<e>', '<f>', 'ghi', '<j>', 'test>', '<g>']

Upvotes: 1

The fourth bird
The fourth bird

Reputation: 163632

In the proposed, solution a single opening < or closing > which are not part of a pair <> are excluded from the result.

If you also want to keep a < or > you could use:

<[^<>]*>|(?:(?!<[^<>]*>).)+

Explanation

  • <[^<>]*> Match opening <, then 0+ times not >, then a closing >
  • | Or
  • (?:(?!<[^<>]*>).)+ Tempered greedy token, match any char if what is directly on the right is not the opening till closing pattern

Regex demo | Python demo

For example:

import re
content = "<abc>d<e><f>ghi<j>test><g>"
result = re.findall(r"<[^<>]*>|(?:(?!<[^<>]*>).)+", content)
print(result)

Result

['<abc>', 'd', '<e>', '<f>', 'ghi', '<j>', 'test>', '<g>']

Upvotes: 1

lilgallon
lilgallon

Reputation: 625

Here is the solution.

import re

content = "<abc>d<e><f>ghi<j>"
result = re.findall(r"<.*?>|[^<>]+", content)

print(result)

Output:

['<abc>', 'd', '<e>', '<f>', 'ghi', '<j>']

Explanations:

  • regex <.*?> means everything that matches <content>
  • regex [^<>]+ means everything else

In brief, findall will find everything that matches <content>, otherwise, everything else. That way, the content will be split without losing the separators.

Upvotes: 1

Related Questions