Sifu
Sifu

Reputation: 163

Need help to work with characters longer than 2 or more bytes in Python

I'm learning about bits and bytes in python by writing a small program that converts strings to binary and back to string again. Temporarily I only have a function that converts to binary.

string = 'word'

for c in word:
    convertToBinary(c) #Function that converts to binary

Output:

01110111
01101111
01110010
01100100

Now I want to write a fromBinary() function that convert from binary to string. However I'm stuck on how to deal with characters that are longer than 1 byte, like for instance 'å'.

string = 'å'

    for c in word:
        convertToCBinary(c)

Output:

    11000011
    10100101

This becomes a problem when I have a string including characters of different length (in bytes).

string = 'åw'

    for c in word:
        convertToCBinary(c)

Output:

11000011    #first byte of 'å'
10100101    #second byte of 'å'
01110111    #w

I was thinking that I could join the bytes back together as one, however I'm really puzzled on how to determine which bytes to join. How can I make a function that recognizes which bytes that together form a character?

Upvotes: 3

Views: 348

Answers (1)

jcoppens
jcoppens

Reputation: 5440

It's not that hard. Of course there's a system to it - else no program could print or edit names like Ñáñez...

The upper bits in each byte indicate what is the status of that byte:

1) if bit 7 is 0, then it's just ASCII (*0*1110111 = w)

2) if you find a 11 at the top, the that means more byte(s) follow (and how many):

   *110*xxxxx *10*xxxxxx
   *1110*xxxx *10*xxxxxx *10*xxxxxx
   *11110*xxx *10*xxxxxx *10*xxxxxx *10*xxxxxx
   *111110*xx *10*xxxxxx *10*xxxxxx *10*xxxxxx *10*xxxxxx
   *1111110*x *10*xxxxxx *10*xxxxxx *10*xxxxxx *10*xxxxxx *10*xxxxxx

11000011    #first byte of 'å'
10100101    #second byte of 'å'

Thus:

*110* means 1 byte follows:
*110*00011 *10*100101

00011 + 100101 = 000 11100101 = the unicode value for å (0x00e5)

Note: I believe there's a problem with your w in your example.

Upvotes: 1

Related Questions