Jesvin Jose
Jesvin Jose

Reputation: 23078

How to parse HTML to a string template in Python?

I want to parse HTML and turn them into string templates. In the example below, I seeked out elements marked with x-inner and they became template placeholders in the final string. Also x-attrsite also became a template placeholder (with a different command of course).

Input:

<div class="x,y,z" x-attrsite>
  <div x-inner></div>
  <div>
    <div x-inner></div>
  </div>
</div>

Desired output:

<div class="x,y,z" {attrsite}>{inner}<div>{inner}</div></div>

I know there is HTMLParser and BeautifulSoup, but I am at a loss on how to extract the strings before and after the x-* markers and to escape those strings for templating.


Existing curly braces are handled sanely, like this sample:

<div x-maybe-highlighted> The template string "there are {n} message{suffix}" can be used.</div>

Upvotes: 1

Views: 1244

Answers (1)

alecxe
alecxe

Reputation: 473763

BeautifulSoup can handle the case:

  • find all div elements with x-attrsite attribute, remove the attribute and add {attrsite} attribute with a value None (produces an attribute with no value)
  • find all div elements with x-inner attribute and use replace_with() to replace the element with a text {inner}

Implementation:

from bs4 import BeautifulSoup

data = """
<div class="x,y,z" x-attrsite>
  <div x-inner></div>
  <div>
    <div x-inner></div>
  </div>
</div>
"""

soup = BeautifulSoup(data, 'html.parser')

for div in soup.find_all('div', {'x-attrsite': True}):
    del div['x-attrsite']
    div['{attrsite}'] = None

for div in soup.find_all('div', {'x-inner': True}):
    div.replace_with('{inner}')

print(soup.prettify())

Prints:

<div class="x,y,z" {attrsite}>
 {inner}
 <div>
  {inner}
 </div>
</div>

Upvotes: 2

Related Questions