Reputation: 41
I noticed that while you are inputting emojis in your phone message some of them take 1 character and some of them are taking 2. For example, "♊" takes 1 char but "😁" takes 2. In python, I'm trying to get length of emojis and I'm getting:
len("♊") # 3
len("😁") # 4
len(unicode("♊", "utf-8")) # 1 OH IT WORKS!
len(unicode("😁", "utf-8")) # 1 Oh wait, no it doesn't.
Any ideas?
This site has emojis length in Character.charCount()
row: http://www.fileformat.info/info/unicode/char/1F601/index.htm
Upvotes: 4
Views: 2222
Reputation: 30113
Read sys.maxunicode:
An integer giving the value of the largest Unicode code point, i.e.
1114111
(0x10FFFF
in hexadecimal).Changed in version 3.3: Before PEP 393,
sys.maxunicode
used to be either0xFFFF
or0x10FFFF
, depending on the configuration option that specified whether Unicode characters were stored asUCS-2
orUCS-4
.
The following script should work in both Python versions 2 an 3:
# coding=utf-8
from __future__ import print_function
import sys, platform, unicodedata
print( platform.python_version(), 'maxunicode', hex(sys.maxunicode))
tab = '\t'
unistr = u'\u264a \U0001f601' ### unistr = u'♊ 😁'
print ( len(unistr), tab, unistr, tab, repr( unistr))
for char in unistr:
print (len(char), tab, char, tab, repr(char), tab,
unicodedata.category(char), tab, unicodedata.name(char,'private use'))
Output shows consequence of different sys.maxunicode
property value. For instance, the 😁
character (unicode codepoint 0x1f601
above the Basic Multilingual Plane) is converted to corresponding surrogate pair (codepoints u'\ud83d'
and u'\ude01'
) if sys.maxunicode
results to 0xFFFF
:
PS D:\PShell> [System.Console]::OutputEncoding = [System.Text.Encoding]::UTF8
PS D:\PShell> . py -3 D:\test\Python\Py\42783173.py
3.5.1 maxunicode 0x10ffff
3 ♊ 😁 '♊ 😁'
1 ♊ '♊' So GEMINI
1 ' ' Zs SPACE
1 😁 '😁' So GRINNING FACE WITH SMILING EYES
PS D:\PShell> . py -2 D:\test\Python\Py\42783173.py
2.7.12 maxunicode 0xffff
4 ♊ 😁 u'\u264a \U0001f601'
1 ♊ u'\u264a' So GEMINI
1 u' ' Zs SPACE
1 �� u'\ud83d' Cs private use
1 �� u'\ude01' Cs private use
Note: above output examples were taken from Unicode-aware Powershell-ISE console pane.
Upvotes: 1