How can I access to string with byte?

Question

I have a string containing multilingual letters and special character.

str1 = "가나42hello world 門()&*&#  [1]"

In above string, "가", "나", "門" are 2 bytes, and the rest are 1 byte.

Under these circumstances, is there any way to get the character 'h', which is corresponding to 7th byte of that string? (not str[7], 'l')

I mean, Can I do random access using byte_index?

I use Python.

Koterpillar · Accepted Answer

Strings in Python (assuming Python 3) are sequences of characters (where e.g. 한 is one character).

They can be represented in memory using different encodings, which represent each character using one or more bytes. Not all encodings can represent all characters, not all encodings require the same amount of bytes.

Assuming UTF-8, let's encode the string and inspect bytes:

s = "가나42hello world 門()&*&# [1]"
b = s.encode("utf-8")
print(b[8])
# This prints 104, the UTF-8 code for 'h'
print(chr(b[8]))
# This prints 'h'
print(b[0:3].decode("utf-8"))
# This prints '가'

Note that in UTF-8, each Hangul character takes 3 bytes, not 2, so I've adjusted the indices. If you want the default encoding, omit the parameter to encode and decode. If you want to find out which encoding you are using, check sys.getdefaultencoding().

How can I access to string with byte?

Answers (2)

Related Questions