Reputation: 165
I have strings such as:
part one<p>part two</p><p>part three <a href="/links/link1">part four</a>part five</p><li>part six <a href="/links/link2">part seven</a>part eight</li>
and I want to generate a python list like:
['part one','part two','part three','/links/link1','part four','part five','part six','/links/link2','part seven','part eight']
The order in the list should follow the order of occurrence in the string. The strings might have no/more/less tags and nested tags.
I've read answers to some fairly similar questions, but couldn't find one that addresses this specific problem. I've tried packages like BeautifulSoup and such, but couldn't extract all parts and in the order of occurrence.
I appreciate any help. Thanks.
Upvotes: 1
Views: 1476
Reputation: 71451
You can use BeautifulSoup
with recursion:
import bs4
s = 'part one<p>part two</p><p>part three <a href="/links/link1">part four</a>part five</p><li>part six <a href="/links/link2">part seven</a>part eight</li>'
def get_data(d):
if isinstance(d, bs4.element.NavigableString):
yield d
if d.name == 'a':
yield d['href']
yield from [i for b in getattr(d, 'contents', []) for i in get_data(b) if b != '\n']
print(list(get_data(bs4.BeautifulSoup(s, 'html.parser'))))
Output:
['part one', 'part two', 'part three ', '/links/link1', 'part four', 'part five', 'part six ', '/links/link2', 'part seven', 'part eight']
Upvotes: 0
Reputation: 168986
You could use the built-in HTML parser class to walk the string and keep track of the bits you need.
from html.parser import HTMLParser
class BuildingBlocksParser(HTMLParser):
def __init__(self):
super().__init__()
self.bits = []
def handle_starttag(self, tag, attrs):
for key, value in attrs:
self.bits.append(value)
def handle_data(self, data):
self.bits.append(data)
parser = BuildingBlocksParser()
parser.feed(
'part one<p>part two</p><p>part three <a href="/links/link1">part four</a>part five</p><li>part six <a href="/links/link2">part seven</a>part eight</li>'
)
print(parser.bits)
outputs
['part one', 'part two', 'part three ', '/links/link1', 'part four', 'part five', 'part six ', '/links/link2', 'part seven', 'part eight']
Upvotes: 2