Reputation: 1173

Python split at tag regex

I'm trying to split these lines:

<label>Olympic Games</label>
<title>Next stop</title>

Into:

["<label>", "Olympic Games", "</label>"]
["<title>", "Next stop", "</title>"]

In Python I can use regular expressions but what I've made doesn't do anything:

line.split("<\*>")

Upvotes: 4

Answers (4)

pylang

Reputation: 44485

If you don't mind punctuation, here is a quick non-regex alternative using itertools.groupby.

Code

import itertools as it


def split_at(iterable, pred, keep_delimter=False):
    """Return an iterable split by a delimiter."""
    if keep_delimter:
        return [list(g) for k, g in it.groupby(iterable, pred)]
    return [list(g) for k, g in it.groupby(iterable, pred) if k]

Demo

>>> words = "Lorem ipsum ..., consectetur ... elit, sed do eiusmod ...".split(" ")
>>> pred = lambda x: "elit" in x
>>> split_at(words, pred, True)
[['Lorem', 'ipsum', '...,', 'consectetur', '...'],
 ['elit,'],
 ['sed', 'do', 'eiusmod', '...']]

>>> words = "Lorem ipsum ..., consectetur ... elit, sed do eiusmod ...".split(" ")
>>> pred = lambda x: "consect" in x
>>> split_at(words, pred, True)
[['Lorem', 'ipsum', '...,'],
 ['consectetur'],
 ['...', 'elit,', 'sed', 'do', 'eiusmod', '...']]

Upvotes: 0

Sandipan Dey

Reputation: 23101

Data:

line = """<label>Olympic Games</label>
<title>Next stop</title>"""

With look-ahead / look-behind assertions with re.findall:

import re

pattern = re.compile("(<.*(?<=>))(.*)((?=</)[^>]*>)")
print re.findall(pattern, line)
# [('<label>', 'Olympic Games', '</label>'), ('<title>', 'Next stop', '</title>')]

Without look-ahead / look-behind assertions, just by capturing groups, with re.findall:

pattern = re.compile("(<[^>]*>)(.*)(</[^>]*>)")
print re.findall(pattern, line)
# [('<label>', 'Olympic Games', '</label>'), ('<title>', 'Next stop', '</title>')]

Upvotes: 2

Alejandro Alcalde

Reputation: 6220

This regex works for me:

<(label|title)>([^<]*)</(label|title)>

or, as cwallenpoole suggested:

<(label|title)>([^<]*)</(\1)>

I've used http://www.regexpal.com/

I have used three capturing groups, if you don't need them, simply remove the ()

What is wrong about your regex <\*> is that is matching only one thing: <*>. You have scaped * using \*, so what you are saying is:

Match any text with <, then a * and then a >.

Upvotes: 3

Aran-Fey

Reputation: 43156

Using lookarounds and a capture group to keep the text after splitting:

re.split(r'(?<=>)(.+?)(?=<)', '<label>Olympic Games</label>')

Upvotes: 4

Python split at tag regex

Answers (4)

Related Questions