Reputation: 1407
I'm writing a Python script to read Unicode characters from a file and insert them into a database. I can only insert 30 bytes of each string. How do I calculate the size of the string in bytes before I insert into the database?
Upvotes: 0
Views: 1473
Reputation: 4037
Suppose you are reading the unicode characters from file into a variable called byteString
. Then you can do the following:
unicode_string = byteString.decode("utf-8")
print len(unicode_string)
Upvotes: 0
Reputation: 414079
If you need to know the bytes count (the file size) then just call
bytes_count = os.path.getsize(filename)
.
If you want to find out how many bytes a Unicode character may require then it depends on character encoding:
>>> print(u"\N{EURO SIGN}")
€
>>> u"\N{EURO SIGN}".encode('utf-8') # 3 bytes
'\xe2\x82\xac'
>>> u"\N{EURO SIGN}".encode('cp1252') # 1 byte
'\x80'
>>> u"\N{EURO SIGN}".encode('utf-16le') # 2 bytes
'\xac '
To find out how many Unicode characters a file contains, you don't need to read the whole file in memory at once (in case it is a large file):
with open(filename, encoding=character_encoding) as file:
unicode_character_count = sum(len(line) for line in file)
If you are on Python 2 then add from io import open
at the top.
The exact count for the same human-readable text may depend on Unicode normalization (different environments may use different settings):
>>> import unicodedata
>>> print(u"\u212b")
Å
>>> unicodedata.normalize("NFD", u"\u212b") # 2 Unicode codepoints
u'A\u030a'
>>> unicodedata.normalize("NFC", u"\u212b") # 1 Unicode codepoint
u'\xc5'
>>> unicodedata.normalize("NFKD", u"\u212b") # 2 Unicode codepoints
u'A\u030a'
>>> unicodedata.normalize("NFKC", u"\u212b") # 1 Unicode codepoint
u'\xc5'
As the example shows, a single character (Å) may be represented using several Unicode codepoints.
To find out how many user-perceived characters in a file, you could use \X
regular expression (count eXtended grapheme clusters):
import regex # $ pip install regex
with open(filename, encoding=character_encoding) as file:
character_count = sum(len(regex.findall(r'\X', line)) for line in file)
Example:
>>> import regex
>>> char = u'A\u030a'
>>> print(char)
Å
>>> len(char)
2
>>> regex.findall(r'\X', char)
['Å']
>>> len(regex.findall(r'\X', char))
1
Upvotes: 5