Amyth
Amyth

Reputation: 32959

Stripping space between html tags

I have a string that contains some html tags as follows:

"<p>   This is a   test   </p>"

I want to strip all the extra spaces between the tags. I have tried the following:

In [1]: import re

In [2]: val = "<p>   This is a   test   </p>"

In [3]: re.sub("\s{2,}", "", val)
Out[3]: '<p>This is atest</p>'

In [4]: re.sub("\s\s+", "", val)
Out[4]: '<p>This is atest</p>'

In [5]: re.sub("\s+", "", val)
Out[5]: '<p>Thisisatest</p>'

but am not able to get the desired result i.e. <p>This is a test</p>

How can I acheive this ?

Upvotes: 1

Views: 649

Answers (6)

UltraInstinct
UltraInstinct

Reputation: 44444

From the question, I see that you are using a very specific HTML string to parse. Although a regular expression is quick and dirty, its not recommend -- use a XML parser instead. Note: XML is stricter than HTML. So if you feel you might not have an XML, use BeautifulSoup as @Haidro suggests.

For your case, you'd do something like this:

>>> import xml.etree.ElementTree as ET
>>> p = ET.fromstring("<p>   This is a   test   </p>")
>>> p.text.strip()
'This is a   test'
>>> p.text = p.text.strip()  # If you want to perform more operation on the string, do it here.
>>> ET.tostring(p)
'<p>This is a   test</p>'

Upvotes: 1

ndpu
ndpu

Reputation: 22571

s = '<p>   This is a   test   </p>'
s = re.sub(r'(\s)(\s*)', '\g<1>', s)
>>> s
'<p> This is a test </p>'
s = re.sub(r'>\s*', '>', s)
s = re.sub(r'\s*<', '<', s)
>>> s
'<p>This is a test</p>'

Upvotes: 0

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89584

You can try this:

re.sub(r'\s+(</)|(<[^/][^>]*>)\s+', '$1$2', val);

Upvotes: 0

flyer
flyer

Reputation: 9816

This may help:

import re

val = "<p>   This is a   test   </p>"
re_strip_p = re.compile("<p>|</p>")

val = '<p>%s</p>' % re_strip_p.sub('', val).strip()

Upvotes: 0

TerryA
TerryA

Reputation: 60004

Try using a HTML parser like BeautifulSoup:

from bs4 import BeautifulSoup as BS
s = "<p>   This is a   test   </p>"
soup = BS(s)
soup.find('p').string =  ' '.join(soup.find('p').text.split())
print soup

Returns:

<p>This is a test</p>

Upvotes: 4

tripleee
tripleee

Reputation: 189678

Try

re.sub(r'\s+<', '<', val)
re.sub(r'>\s+', '>', val)

However, this is too simplistic for general real-world use, where brokets are not necessarily always part if a tag. (Think <code> blocks, <script> blocks, etc.) You should be using a proper HTML parser for anything like that.

Upvotes: 2

Related Questions