Reputation: 1173
I'm trying to split these lines:
<label>Olympic Games</label>
<title>Next stop</title>
Into:
["<label>", "Olympic Games", "</label>"]
["<title>", "Next stop", "</title>"]
In Python I can use regular expressions but what I've made doesn't do anything:
line.split("<\*>")
Upvotes: 4
Views: 2315
Reputation: 44485
If you don't mind punctuation, here is a quick non-regex alternative using itertools.groupby
.
Code
import itertools as it
def split_at(iterable, pred, keep_delimter=False):
"""Return an iterable split by a delimiter."""
if keep_delimter:
return [list(g) for k, g in it.groupby(iterable, pred)]
return [list(g) for k, g in it.groupby(iterable, pred) if k]
Demo
>>> words = "Lorem ipsum ..., consectetur ... elit, sed do eiusmod ...".split(" ")
>>> pred = lambda x: "elit" in x
>>> split_at(words, pred, True)
[['Lorem', 'ipsum', '...,', 'consectetur', '...'],
['elit,'],
['sed', 'do', 'eiusmod', '...']]
>>> words = "Lorem ipsum ..., consectetur ... elit, sed do eiusmod ...".split(" ")
>>> pred = lambda x: "consect" in x
>>> split_at(words, pred, True)
[['Lorem', 'ipsum', '...,'],
['consectetur'],
['...', 'elit,', 'sed', 'do', 'eiusmod', '...']]
Upvotes: 0
Reputation: 23101
Data:
line = """<label>Olympic Games</label>
<title>Next stop</title>"""
With look-ahead / look-behind assertions with re.findall
:
import re
pattern = re.compile("(<.*(?<=>))(.*)((?=</)[^>]*>)")
print re.findall(pattern, line)
# [('<label>', 'Olympic Games', '</label>'), ('<title>', 'Next stop', '</title>')]
Without look-ahead / look-behind assertions, just by capturing groups, with re.findall
:
pattern = re.compile("(<[^>]*>)(.*)(</[^>]*>)")
print re.findall(pattern, line)
# [('<label>', 'Olympic Games', '</label>'), ('<title>', 'Next stop', '</title>')]
Upvotes: 2
Reputation: 6220
This regex works for me:
<(label|title)>([^<]*)</(label|title)>
or, as cwallenpoole suggested:
<(label|title)>([^<]*)</(\1)>
I've used http://www.regexpal.com/
I have used three capturing groups, if you don't need them, simply remove the ()
What is wrong about your regex <\*>
is that is matching only one thing: <*>
. You have scaped *
using \*
, so what you are saying is:
<
, then a *
and then a >
. Upvotes: 3
Reputation: 43156
Using lookarounds and a capture group to keep the text after splitting:
re.split(r'(?<=>)(.+?)(?=<)', '<label>Olympic Games</label>')
Upvotes: 4