Reputation: 63
I have an error: UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 266-266: Non-BMP character not supported in Tk
I'm parsing the data, and some emoji's falls to array. data = 'this variable contains some emoji'sツ😂'
I want: data = 'this variable contains some emoji's'
How I can remove these characters from my data or handle this situation in Python 3?
Upvotes: 4
Views: 4183
Reputation: 1529
>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, data)
"this variable contains some emoji's"
For BMP read this: removing emojis from a string in Python
Upvotes: -2
Reputation: 155684
If the goal is just to remove all characters above '\uFFFF'
, the straightforward approach is to do just that:
data = "this variable contains some emoji'sツ😂"
data = ''.join(c for c in data if c <= '\uFFFF')
It's possible your string is in decomposed form, so you may need to normalize
it to composed form first so the non-BMP characters are identifiable:
import unicodedata
data = ''.join(c for c in unicodedata.normalize('NFC', data) if c <= '\uFFFF')
Upvotes: 11