Reputation: 625
@edzech asked how was it possible to split a string and keep the separators in it. His question was marked as duplicate, whereas the approach here is different than the "duplicate".
We want to split a string but by keeping the delimiters in it, we don't want them to be separated.
In brief, for <abc>d<e><f>ghi<j>
, we want:
['<abc>', 'd', '<e>', '<f>', 'ghi', '<j>']
instead of:
['<', 'abc', '>' 'd', '<', 'e', '>', '<', 'f', '>', 'ghi', '<', 'j', '>']
Using split
does not help since it will split according to the separator. We want to keep it attached to its content.
Upvotes: 1
Views: 141
Reputation:
I believe you can use split with this regex
(?<=>)(?=[a-z<])|(?<=[a-z>])(?=<)
https://regex101.com/r/WNy5n9/1
It's nothing more than 2 option's with paired lookbehind/ahead assertions.
Expanded
(?<= > ) # Behind a >
(?= [a-z<] ) # Ahead either a-z or <
| # or,
(?<= [a-z>] ) # Behind either a-z or >
(?= < ) # Ahead a <
Update
Note that in versions of Python prior to version 3.7 splitting
on an empty match was not handled correctly.
Presumably they couldn't tell the difference between an empty
string and / or how to do the bump along on zero-width matches.
Seems like they pulled their heads out of their a** now in version 3.7,
so here you go..
Demo
Version 3.7.3
>>> import sys
>>> print( sys.version )
3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 21:26:53) [MSC v.1916 32 bit (Intel)]
Code
>>> import re
>>> rx = re.compile( r"(?<=>)(?=[a-z<])|(?<=[a-z>])(?=<)" )
>>> s = "<abc>d<e><f>ghi<j>test><g>"
>>> x = re.split( rx, s )
>>> print ( x )
['<abc>', 'd', '<e>', '<f>', 'ghi', '<j>', 'test>', '<g>']
Upvotes: 1
Reputation: 163632
In the proposed, solution a single opening <
or closing >
which are not part of a pair <> are excluded from the result.
If you also want to keep a <
or >
you could use:
<[^<>]*>|(?:(?!<[^<>]*>).)+
Explanation
<[^<>]*>
Match opening <
, then 0+ times not >
, then a closing >
|
Or(?:(?!<[^<>]*>).)+
Tempered greedy token, match any char if what is directly on the right is not the opening till closing patternFor example:
import re
content = "<abc>d<e><f>ghi<j>test><g>"
result = re.findall(r"<[^<>]*>|(?:(?!<[^<>]*>).)+", content)
print(result)
Result
['<abc>', 'd', '<e>', '<f>', 'ghi', '<j>', 'test>', '<g>']
Upvotes: 1
Reputation: 625
Here is the solution.
import re
content = "<abc>d<e><f>ghi<j>"
result = re.findall(r"<.*?>|[^<>]+", content)
print(result)
Output:
['<abc>', 'd', '<e>', '<f>', 'ghi', '<j>']
Explanations:
<.*?>
means everything that matches <content>
[^<>]+
means everything elseIn brief, findall
will find everything that matches <content>
, otherwise, everything else. That way, the content will be split without losing the separators.
Upvotes: 1