Reputation: 1857
I am working with Amazon S3 uploads and am having trouble with key names being too long. S3 limits the length of the key by bytes, not characters.
From the docs:
The name for a key is a sequence of Unicode characters whose UTF-8 encoding is at most 1024 bytes long.
I also attempt to embed metadata in the file name, so I need to be able to calculate the current byte length of the string using Python to make sure the metadata does not make the key too long (in which case I would have to use a separate metadata file).
How can I determine the byte length of the utf-8 encoded string? Again, I am not interested in the character length... rather the actual byte length used to store the string.
Upvotes: 28
Views: 37946
Reputation: 95242
Use the string 'encode' method to convert from a character-string to a byte-string, then use len() like normal:
>>> s = u"¡Hola, mundo!"
>>> len(s)
13 # characters
>>> len(s.encode('utf-8'))
14 # bytes
Upvotes: 13
Reputation: 308111
Encoding the string and using len
on the result works great, as other answers have shown. It does need to build a throw-away copy of the string - if you're working with very large strings this might not be optimal (I don't consider 1024 bytes to be large though). The structure of UTF-8 allows you to get the length of each character very easily without even encoding it, although it might still be easier to encode a single character. I present both methods here, they should give the same result.
def utf8_char_len_1(c):
codepoint = ord(c)
if codepoint <= 0x7f:
return 1
if codepoint <= 0x7ff:
return 2
if codepoint <= 0xffff:
return 3
if codepoint <= 0x10ffff:
return 4
raise ValueError('Invalid Unicode character: ' + hex(codepoint))
def utf8_char_len_2(c):
return len(c.encode('utf-8'))
utf8_char_len = utf8_char_len_1
def utf8len(s):
return sum(utf8_char_len(c) for c in s)
Upvotes: 8
Reputation: 213268
def utf8len(s):
return len(s.encode('utf-8'))
Works fine in Python 2 and 3.
Upvotes: 43