Remove characters outside of the BMP (emoji's) in Python 3

Question

I have an error: UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 266-266: Non-BMP character not supported in Tk

I'm parsing the data, and some emoji's falls to array. data = 'this variable contains some emoji'sツ😂' I want: data = 'this variable contains some emoji's'

How I can remove these characters from my data or handle this situation in Python 3?

ShadowRanger · Accepted Answer

If the goal is just to remove all characters above '\uFFFF', the straightforward approach is to do just that:

data = "this variable contains some emoji'sツ😂"
data = ''.join(c for c in data if c <= '\uFFFF')

It's possible your string is in decomposed form, so you may need to normalize it to composed form first so the non-BMP characters are identifiable:

import unicodedata

data = ''.join(c for c in unicodedata.normalize('NFC', data) if c <= '\uFFFF')

Remove characters outside of the BMP (emoji's) in Python 3

Answers (2)

Related Questions

Remove characters outside of the BMP (emoji&#39;s) in Python 3

Answers (2)

Related Questions

Remove characters outside of the BMP (emoji's) in Python 3