user2665694
user2665694

Reputation:

Removing unneccessary inner tags

We are converting DOCX to HTML through some external converter tool. The generated HTML for tables contains something like this:

<td><div><span><b>Patienten</b></span></div></td>

The <div> and <span> tags inside TD are completely superfluous here.

The expected result is

<td><b>Patienten</b></td>

Is there some chance to remove them in a sane way using BeautifulSoup?

Upvotes: 2

Views: 1857

Answers (6)

PyNEwbie
PyNEwbie

Reputation: 4940

The way we do it is to use lxml and determine the parents and children of every element. If there is no text content difference in the parents and children then we have a set of rules that we follow to retain certain children while tossing the parents. And then forcing the appropriate block elements In your case b is a child of span, div and td we know that the td tag is the structuring element that is relevant so we get rid of the others. Again this requires testing the text content of each of the nested elements.

Upvotes: 1

eyquem
eyquem

Reputation: 27585

If Beautiful Soup alone isn't sufficient, you can resort to regular expression.

import re

ch = 'sunny day<td><div><span><b>Patienten</b></span></div></td>rainy week'
# <td><b>Patienten</b></td>

RE = '(<td>)<div><span>(<b>.*?</b>)</span></div>(</td>)'

pat = re.compile(RE)

print ch
print pat.sub('\\1\\2\\3',ch)

result

sunny day<td><div><span><b>Patienten</b></span></div></td>rainy week
sunny day<td><b>Patienten</b></td>rainy week

Easy, easyn't it ?

A preliminary inspection can be done to determine if the replacement must really be done or not.

Upvotes: 0

Hank Gay
Hank Gay

Reputation: 71979

I like the approach suggested by @Daren Thomas, but be aware that removing those "useless" tags could drastically affect the rendered appearance of the document thanks to JavaScript (less likely) or CSS (much more likely, possibly even probable) that relies on the resulting HTML to follow certain structural patterns, even if they are wasteful.

This makes the life of the tool writer much easier. Assume that some given construct in the DOCX has two possible variations. One of these requires a lot of boilerplate so you can attach a few special attributes (say a text-align or some such). The other doesn't. It's way easier to just always generate the boilerplate and write your CSS or what-have-you with that fact in mind.

Upvotes: 0

das_weezul
das_weezul

Reputation: 6142

You could rearrange the parse tree like this:

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup("<td><div><span><b>Patienten</b></span></div></td>")
td = soup.td
b = soup.td.div.span.b
td.insert(0,b)
td.div.extract()
print soup

Upvotes: 0

Daren Thomas
Daren Thomas

Reputation: 70344

Well, the <div> and <span> tags have a structural meaning, that cannot be automatically guessed as "superfluous".

Your problem looks very similar to AST (Abstract Syntax Tree) optimization done in compilers. You could try to define some rules and build a SoupOptimizer to take a tree (your document) and produce an optimized output tree. Rules could be:

  • span(content) -> content, if span.attributes is empty
  • div(content) -> content, if div.attributes is empty

Note, that tree transformations on XML dialects can be done with XSLT. Just be ready to have your brain turned inside out before you see the light!

Upvotes: 1

F&#225;bio Diniz
F&#225;bio Diniz

Reputation: 10353

You could use the strip_tags function of Jesse Dhillon's answer of this question

Upvotes: 0

Related Questions