Removing Narrow 'No-Break Space' Unicode Characters (U+00A0) in python nlp

Non-breaking spaces are printed as whitespace, but handled internally as \xa0. How do I remove all these characters at once?

So far I've replaced it directly:

text = text.replace('\u202f','')  
text = text.replace('\u200d','') 
text = text.replace('\xa0','')

But each time I scrape the text sentences from external source, These characters are different. How do I remove it all at once?

Upvotes: 3

Answers (2)

Pickle

Reputation: 94

You can use \h (horizontal whitespace) to match non-breaking spaces (\s will also match vertical whitespace; newlines and the like)

\h Equivalent to [\t\x{00A0}\x{1680}\x{180E}\x{2000}\x{2001}\x{2002}\x{2003}\x{2004}\x{2005}\x{2006}\x{2007}\x{2008}\x{2009}\x{200A}\x{202F}\x{205F}\x{3000}]

https://regex101.com/ is a great resource for regex. you can search for "whitespace" in the Quick Reference (bottom right) to see other related options that may be helpful to you. (Remember to check full search results)

Upvotes: 0

MrBean Bremen

Reputation: 16825

You can use regular expression substitution instead.
If you want to replace all whitespace, you can just use:

import re

text = re.sub(r'\s', '', text)

This includes all unicode whitespace, as described in the answer to this question.
From that answer, you can see that (at the time of writing), the unicode constants recognized as whitespace (e.g. \s) in Python regular expressions are these:

0x0009
0x000A
0x000B
0x000C
0x000D
0x001C
0x001D
0x001E
0x001F
0x0020
0x0085
0x00A0
0x1680
0x2000
0x2001
0x2002
0x2003
0x2004
0x2005
0x2006
0x2007
0x2008
0x2009
0x200A
0x2028
0x2029
0x202F
0x205F
0x3000

This looks as if this will fit your needs.

Upvotes: 2

Removing Narrow &#39;No-Break Space&#39; Unicode Characters (U+00A0) in python nlp

Answers (2)

Related Questions

Removing Narrow 'No-Break Space' Unicode Characters (U+00A0) in python nlp