강태천
강태천

Reputation: 3

How can I access to string with byte?

I have a string containing multilingual letters and special character.

str1 = "가나42hello world 門()&*&#  [1]"

In above string, "가", "나", "門" are 2 bytes, and the rest are 1 byte.

Under these circumstances, is there any way to get the character 'h', which is corresponding to 7th byte of that string? (not str[7], 'l')

I mean, Can I do random access using byte_index?

I use Python.

Upvotes: 0

Views: 256

Answers (2)

Jeremy McGibbon
Jeremy McGibbon

Reputation: 3785

This is perhaps dependent on your encoding (when I decode that string using utf-8, the special characters are 3 bytes instead of 2 bytes), but in general you can do this by converting to bytes, performing your selection, and then converting back. For example, the following will print 'h':

s = "가나42hello world 門()&*&# [1]"
b = bytes(s, encoding="utf-8")
selection = b[8:9].decode("utf-8")
print(selection)

It is important that the slice operation on b is a slice and not a single index selection (e.g. using 8:9 to get the 8th byte).

Upvotes: 0

Koterpillar
Koterpillar

Reputation: 8104

Strings in Python (assuming Python 3) are sequences of characters (where e.g. 한 is one character).

They can be represented in memory using different encodings, which represent each character using one or more bytes. Not all encodings can represent all characters, not all encodings require the same amount of bytes.

Assuming UTF-8, let's encode the string and inspect bytes:

s = "가나42hello world 門()&*&# [1]"
b = s.encode("utf-8")
print(b[8])
# This prints 104, the UTF-8 code for 'h'
print(chr(b[8]))
# This prints 'h'
print(b[0:3].decode("utf-8"))
# This prints '가'

Note that in UTF-8, each Hangul character takes 3 bytes, not 2, so I've adjusted the indices. If you want the default encoding, omit the parameter to encode and decode. If you want to find out which encoding you are using, check sys.getdefaultencoding().

Upvotes: 2

Related Questions