Reputation: 74104
Having an utf-8 encoded string like this:
bar = "hello 。◕‿‿◕。"
and a bytes offset that tells me at which byte I have to split the string:
bytes_offset = 9
how can I split the bar string in two parts resulting in:
>>first_part
'hello 。' <---- #9 bytes 'hello \xef\xbd\xa1'
>>second_part
'◕‿‿◕。'
In a nutshell:
given a bytes offset, how can I transform it in the actual char index position of an utf-8 encoded string?
Upvotes: 2
Views: 2423
Reputation: 961
Character offset is a number of characters before byte offset:
def byte_to_char_offset(b_string, b_offset, encoding='utf8'):
return len(b_string[:b_offset].decode(encoding))
Upvotes: 0
Reputation: 19037
UTF-8 Python 2.x strings are basically byte strings.
# -*- coding: utf-8 -*-
bar = "hello 。◕‿‿◕。"
assert(isinstance(bar, str))
first_part = bar[:9]
second_part = bar[9:]
print first_part
print second_part
Yields:
hello 。
◕‿‿◕。
Python 2.6 on OSX here but I expect the same from 2.7. If I split on 10 or 11 instead of 9, I get ? characters output implying that it broke the sequence of bytes in the middle of a multibyte character sequence; splitting on 12 moves the first "eyeball" to the first part of the string.
I have PYTHONIOENCODING set to utf8 in the terminal.
Upvotes: 3