Reputation: 779
I'm preprocessing some text for a NER model I'm training, and I'm encountering this character quite a lot. This character is not removed with strip()
:
>>> 'Hello world!\u200b'.strip()
'Hello world!\u200b'
It is not considered a whitespace for regular expressions:
>>> re.sub('\s+', ' ', "hello\u200bworld!")
'hello\u200bworld!'
and spaCy's tokenizer does not split tokens upon it:
>>> [t.text for t in nlp("hello\u200bworld!")]
['hello\u200bworld', '!']
So, how should I handle it? I can simply replace it, however I don't want to make a special case for this character, but rather replace all characters with similar characteristics.
Thanks.
Upvotes: 11
Views: 8577
Reputation: 10011
How about simply doing string replace before NLP?
'Hello world!\u200b'.replace('\u200b', ' ').strip()
Upvotes: 3
Reputation: 76
As you mentioned, characters like \u200b
(zero-width space) and \u200c
(zero-width non joiner) are not considered as a space character. So, you cannot omit such characters using techniques available for space characters.
The only way, as you may have noticed, is to consider such characters as a special case.
Upvotes: 5