Reputation:
We are converting DOCX to HTML through some external converter tool. The generated HTML for tables contains something like this:
<td><div><span><b>Patienten</b></span></div></td>
The <div>
and <span>
tags inside TD are completely superfluous here.
The expected result is
<td><b>Patienten</b></td>
Is there some chance to remove them in a sane way using BeautifulSoup?
Upvotes: 2
Views: 1857
Reputation: 4940
The way we do it is to use lxml and determine the parents and children of every element. If there is no text content difference in the parents and children then we have a set of rules that we follow to retain certain children while tossing the parents. And then forcing the appropriate block elements In your case b is a child of span, div and td we know that the td tag is the structuring element that is relevant so we get rid of the others. Again this requires testing the text content of each of the nested elements.
Upvotes: 1
Reputation: 27585
If Beautiful Soup alone isn't sufficient, you can resort to regular expression.
import re
ch = 'sunny day<td><div><span><b>Patienten</b></span></div></td>rainy week'
# <td><b>Patienten</b></td>
RE = '(<td>)<div><span>(<b>.*?</b>)</span></div>(</td>)'
pat = re.compile(RE)
print ch
print pat.sub('\\1\\2\\3',ch)
result
sunny day<td><div><span><b>Patienten</b></span></div></td>rainy week
sunny day<td><b>Patienten</b></td>rainy week
Easy, easyn't it ?
A preliminary inspection can be done to determine if the replacement must really be done or not.
Upvotes: 0
Reputation: 71979
I like the approach suggested by @Daren Thomas, but be aware that removing those "useless" tags could drastically affect the rendered appearance of the document thanks to JavaScript (less likely) or CSS (much more likely, possibly even probable) that relies on the resulting HTML to follow certain structural patterns, even if they are wasteful.
This makes the life of the tool writer much easier. Assume that some given construct in the DOCX has two possible variations. One of these requires a lot of boilerplate so you can attach a few special attributes (say a text-align
or some such). The other doesn't. It's way easier to just always generate the boilerplate and write your CSS or what-have-you with that fact in mind.
Upvotes: 0
Reputation: 6142
You could rearrange the parse tree like this:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("<td><div><span><b>Patienten</b></span></div></td>")
td = soup.td
b = soup.td.div.span.b
td.insert(0,b)
td.div.extract()
print soup
Upvotes: 0
Reputation: 70344
Well, the <div>
and <span>
tags have a structural meaning, that cannot be automatically guessed as "superfluous".
Your problem looks very similar to AST (Abstract Syntax Tree) optimization done in compilers. You could try to define some rules and build a SoupOptimizer
to take a tree (your document) and produce an optimized output tree. Rules could be:
Note, that tree transformations on XML dialects can be done with XSLT. Just be ready to have your brain turned inside out before you see the light!
Upvotes: 1
Reputation: 10353
You could use the strip_tags
function of Jesse Dhillon's answer of this question
Upvotes: 0