Alice Sigane
Alice Sigane

Reputation: 19

How to encoding a string into a bytearray using utf-8?

I want to encode a string in a byte array using utf-8. For example, for the string "CD" I want to obtain b"\x43\x44". I have tried this, but it's not working:

def toTab(strMessage):
    return strMessage.encode('utf-8')

and I get b'CD', which is not the result I want.

Upvotes: 2

Views: 1454

Answers (2)

jacob
jacob

Reputation: 1097

One of the major changes from Python 2 to 3 was with the str data type. More about that here. Basically, they try their hardest to be human readable at all times, which can lead to some interesting and frustrating things when trying to keep hex values in a string. The b in front of a string tells python that it is encoded, so your function actually is working, but it is displayed as human readable. To show this, simply try:

b'CD'.hex()

or, more specifically:

'CD'.encode().hex()

which gives:

'4344'

EDIT: To clarify, a python str will always represent ASCII as ASCII. This can be shown by entering the following into a console:

"résumé".encode("utf-8")

which will yield:

b'r\xc3\xa9sum\xc3\xa9'

Notice that all ASCII is rendered as such and non-ASCI is represented by bytes. Also notice something key, UTF-8 characters can be represented with anything form 1 to 4 bytes (where a byte is 8 bits). The entire ASCII set on the other hand can be represented with only 7 bits, leaving all ASCII bytes zero padded.

So again, your output is b'\x43\x44', it is just visually represented as b'CD'. If you passed this to a c program to, say, exploit a buffer overflow, it recognizes the string as b'\x43\x44' as you desire.

To show this, try:

if b'\x43\x44' == b'CD':
    print(True, b'\x43\x44')
else:
    print(False)

Which will print: True b'CD'

Upvotes: 2

martineau
martineau

Reputation: 123481

You could get what you want by combining and formatting each byte of the bytearray manually.

def toTab(strMessage):
    return 'b"{}"'.format(''.join(r'\x{:0x}'.format(b) for b in strMessage))

msg = b"\x43\x44"
print(toTab(msg))  # -> b"\x43\x44"

Upvotes: 1

Related Questions