user7707957
user7707957

Reputation: 41

Getting proper length of emojis

I noticed that while you are inputting emojis in your phone message some of them take 1 character and some of them are taking 2. For example, "♊" takes 1 char but "😁" takes 2. In python, I'm trying to get length of emojis and I'm getting:

len("♊") # 3
len("😁") # 4
len(unicode("♊", "utf-8")) # 1 OH IT WORKS!
len(unicode("😁", "utf-8")) # 1 Oh wait, no it doesn't.

Any ideas?

This site has emojis length in Character.charCount() row: http://www.fileformat.info/info/unicode/char/1F601/index.htm

Upvotes: 4

Views: 2222

Answers (1)

JosefZ
JosefZ

Reputation: 30113

Read sys.maxunicode:

An integer giving the value of the largest Unicode code point, i.e. 1114111 (0x10FFFF in hexadecimal).

Changed in version 3.3: Before PEP 393, sys.maxunicode used to be either 0xFFFF or 0x10FFFF, depending on the configuration option that specified whether Unicode characters were stored as UCS-2 or UCS-4.

The following script should work in both Python versions 2 an 3:

# coding=utf-8

from __future__ import print_function
import sys, platform, unicodedata

print( platform.python_version(), 'maxunicode', hex(sys.maxunicode))
tab = '\t'
unistr = u'\u264a \U0001f601'                          ###   unistr = u'♊ 😁'
print ( len(unistr), tab, unistr, tab, repr( unistr))
for char in unistr:
    print (len(char), tab, char, tab, repr(char), tab, 
        unicodedata.category(char), tab, unicodedata.name(char,'private use'))

Output shows consequence of different sys.maxunicode property value. For instance, the 😁 character (unicode codepoint 0x1f601 above the Basic Multilingual Plane) is converted to corresponding surrogate pair (codepoints u'\ud83d' and u'\ude01') if sys.maxunicode results to 0xFFFF:

PS D:\PShell> [System.Console]::OutputEncoding = [System.Text.Encoding]::UTF8

PS D:\PShell> . py -3 D:\test\Python\Py\42783173.py
3.5.1 maxunicode 0x10ffff
3      ♊ 😁   '♊ 😁'
1      ♊      '♊'      So      GEMINI
1             ' '      Zs      SPACE
1      😁     '😁'      So      GRINNING FACE WITH SMILING EYES

PS D:\PShell> . py -2 D:\test\Python\Py\42783173.py
2.7.12 maxunicode 0xffff
4      ♊ 😁   u'\u264a \U0001f601'
1      ♊      u'\u264a'    So      GEMINI
1             u' '         Zs      SPACE
1      ��     u'\ud83d'    Cs      private use
1      ��     u'\ude01'    Cs      private use

Note: above output examples were taken from Unicode-aware Powershell-ISE console pane.

Upvotes: 1

Related Questions