Alijy
Alijy

Reputation: 165

Splitting a string that includes html tags into it's building blocks in Python

I have strings such as:

part one<p>part two</p><p>part three <a href="/links/link1">part four</a>part five</p><li>part six <a href="/links/link2">part seven</a>part eight</li>

and I want to generate a python list like:

['part one','part two','part three','/links/link1','part four','part five','part six','/links/link2','part seven','part eight']

The order in the list should follow the order of occurrence in the string. The strings might have no/more/less tags and nested tags.

I've read answers to some fairly similar questions, but couldn't find one that addresses this specific problem. I've tried packages like BeautifulSoup and such, but couldn't extract all parts and in the order of occurrence.

I appreciate any help. Thanks.

Upvotes: 1

Views: 1476

Answers (2)

Ajax1234
Ajax1234

Reputation: 71451

You can use BeautifulSoup with recursion:

import bs4
s = 'part one<p>part two</p><p>part three <a href="/links/link1">part four</a>part five</p><li>part six <a href="/links/link2">part seven</a>part eight</li>'
def get_data(d):
  if isinstance(d, bs4.element.NavigableString):
     yield d
  if d.name == 'a':
     yield d['href']
  yield from [i for b in getattr(d, 'contents', []) for i in get_data(b) if b != '\n']

print(list(get_data(bs4.BeautifulSoup(s, 'html.parser'))))

Output:

['part one', 'part two', 'part three ', '/links/link1', 'part four', 'part five', 'part six ', '/links/link2', 'part seven', 'part eight']

Upvotes: 0

AKX
AKX

Reputation: 168986

You could use the built-in HTML parser class to walk the string and keep track of the bits you need.

from html.parser import HTMLParser


class BuildingBlocksParser(HTMLParser):
    def __init__(self):
        super().__init__()
        self.bits = []

    def handle_starttag(self, tag, attrs):
        for key, value in attrs:
            self.bits.append(value)

    def handle_data(self, data):
        self.bits.append(data)


parser = BuildingBlocksParser()
parser.feed(
    'part one<p>part two</p><p>part three <a href="/links/link1">part four</a>part five</p><li>part six <a href="/links/link2">part seven</a>part eight</li>'
)
print(parser.bits)

outputs

['part one', 'part two', 'part three ', '/links/link1', 'part four', 'part five', 'part six ', '/links/link2', 'part seven', 'part eight']

Upvotes: 2

Related Questions