pygabriel
pygabriel

Reputation: 10008

Removing spaces and newlines between tags in html (aka unformatting) in python

An example:

<p> Hello</p>
<div>hgello</div>
<pre>
   code
    code
<pre>

turns in something like:

<p> Hello</p><div>hgello</div><pre>
    code
     code
<pre>

How to do this in python? I make also intensive use of < pre> tags so substituting all '\n' with '' is not an option.

What's the best way to do that?

Upvotes: 2

Views: 5907

Answers (2)

Kyra
Kyra

Reputation: 5407

I would choose to use the python regex:

string.replace(">\s+<","><")

Where the '\s' finds any whitespace character and the '+' after it shows it matches one or more whitespace characters. This removes the possibility of the replace replacing

<pre>
    code
     code
<pre>

with

<pre><pre>

More information about regular expressions can be found here, here and here.

Upvotes: 2

phimuemue
phimuemue

Reputation: 36031

You could use re.sub(">\s*<","><","[here your html string]").

Maybe string.replace(">\n",">"), i.e. look for an enclosing bracket and a newline and remove the newline.

Upvotes: 6

Related Questions