Reputation: 930
This is the input sample text. I want to do in object based cleanup to avoid hierarchy issues
<p><b><b><i><b><i><b>
<i>sample text</i>
</b></i></b></i></b></b></p>
Required Output
<p><b><i>sample text</i></b></p>
Upvotes: 2
Views: 180
Reputation: 930
I written this Object based cleanup using lxml for sublevel duplicate tags. It may help others.
import lxml.etree as ET
textcont = '<p><b><b><i><b><i><b><i>sample text</i></b></i></b></i></b></b></p>'
soup = ET.fromstring(textcont)
for tname in ['i','b']:
for tagn in soup.iter(tname):
if tagn.getparent().getparent() != None and tagn.getparent().getparent().tag == tname:
iparOfParent = tagn.getparent().getparent()
iParent = tagn.getparent()
if iparOfParent.text == None:
iparOfParent.addnext(iParent)
iparOfParent.getparent().remove(iparOfParent)
elif tagn.getparent() != None and tagn.getparent().tag == tname:
iParent = tagn.getparent()
if iParent.text == None:
iParent.addnext(tagn)
iParent.getparent().remove(iParent)
print(ET.tostring(soup))
output:
b'<p><b><i>sample text</i></b></p>'
Upvotes: 2
Reputation: 1683
You can use the regex library re
to create a function to search for the matching opening tag and closing tag pair and everything else in between. Storing tags in a dictionary will remove duplicate tags and maintain the order they were found in (if order isn't important then just use a set). Once all pairs of tags are found, wrap what's left with the keys of the dictionary in reverse order.
import re
def remove_duplicates(string):
tags = {}
while (match := re.findall(r'\<(.+)\>([\w\W]*)\<\/\1\>', string)):
tag, string = match[0][0], match[0][1] # match is [(group0, group1)]
tags.update({tag: None})
for tag in reversed(tags):
string = f'<{tag}>{string}</{tag}>'
return string
Note: I've used [\w\W]*
as a cheat to match everything.
Upvotes: 0
Reputation: 15
Markdown, itself, provides structural to extract elements inside
Using re
in python, you may extract elements and recombine them.
For example:
import re
html = """<p><b><b><i><b><i><b>
<i>sample text</i>
</b></i></b></i></b></b></p>"""
regex_object = re.compile("\<(.*?)\>")
html_objects = regex_object.findall(html)
set_html = []
for obj in html_objects:
if obj[0] != "/" and obj not in set_html:
set_html.append(obj)
regex_text = re.compile("\>(.*?)\<")
text = [result for result in regex_text.findall(html) if result][0]
# Recombine
result = ""
for obj in set_html:
result += f"<{obj}>"
result += text
for obj in set_html[::-1]:
result += f"</{obj}>"
# result = '<p><b><i>sample text</i></b></p>'
Upvotes: 0