Reputation: 2552
I am using this function to check if a string contains multiple white spaces:
def check_multiple_white_spaces(text):
return " " in text
and it is usually working fine, but not in this following code:
from bs4 import BeautifulSoup
from string import punctuation
text = "<p>Hello world!!</p>\r\n\r"
text = BeautifulSoup(text, 'html.parser').text
text = ''.join(ch for ch in text if ch not in set(punctuation))
text = text.lower().replace('\n', ' ').replace('\t', '').replace('\r', '')
print check_multiple_white_spaces(text)
The final value of text
variable is hello world
, but I don't know why the check_multiple_white_spaces
function is returning False
instead of True
.
How can I fix this?
Upvotes: 1
Views: 6584
Reputation: 10223
There is no consecutive space in text
variable, that’s why check_multiple_white_spaces
function return False
value.
>>> text
u'hello \xa0 \xa0 \xa0world '
>>> print text
hello world
\xa0
is no-break space, non-breakable space (NBSP), hard space.
Value os space is 32 and value of non-break space is 160
(u' ', 32)
(u'\xa0', 160)
The character \xa0 is a NO-BREAK SPACE, and the closest ASCII equivalent would of course be a regular space.
Use unidecode module
to convert all non-ASCII characters to their closest ASCII equivalent
Demo:
>>> import unidecode
>>> unidecode.unidecode(text)
'hello world '
>>> " " in unidecode.unidecode(text)
True
Upvotes: 0
Reputation: 46779
If you were to print the contents of text
using repr()
, you will see that it does not contain two consecutive spaces:
'hello \xa0 \xa0 \xa0world '
As a result, your function correctly returns False
. This could be fixed by converting the non-break space into a space:
text = text.replace(u'\xa0', u' ')
Upvotes: 3
Reputation: 20224
First, your function check_multiple_white_spaces
cannot really check if there is multiple white spaces as there could be three white spaces or more.
You should use re.search(r"\s{2,}", text)
.
Second, if you print text
, you will find you need to unescape text.
See this answer.
How do I unescape HTML entities in a string in Python 3.1?
Upvotes: 1