Reputation: 3273
I noticed that the difference between lower and uppercase is 32
. This seems like a perfect opportunity to utilize some clever bit manipulation. The problem is that it's been a long time since my Computer Architecture classes, and I'm a bit rusty with the concepts required. From what I do recall, depending on the CPU architecture and language representation of signed/unsigned there are a very small number of solutions that would apply to almost all programming languages with these operators. I'm interested in comparing these. I'm not interested in simply converting the case, I know there are "simpler" ways (for humans, at least). I'm interested in learning about how this problem interacts with the low-level representation of the data.
Please provide workable, minimal solutions for both lowercase->upper and uppercase->lower, for each common representation, as well as a reasonably detailed explanation of how they work.
Upvotes: 1
Views: 562
Reputation: 12158
(Note: i'm using python
here, but this is of course language agnostic. I also do speak about ascii
, so I'll use a 7-bit representation of things.)
If you look at the binary representation of ascii
characters in the [a-z][A-Z]
range, you'll notice two things:
>>> bin(ord('a'))
'0b1100001'
>>> bin(ord('A'))
'0b1000001'
>>> bin(ord('y'))
'0b1111001'
>>> bin(ord('Y'))
'0b1011001'
First: they all have the seventh bit (from right) set. Second: the lowercase characters have the sixth bit (from right) set, the uppercase ones unset, and that's the only difference between a given uppercase character and it's lowercase version (et vice versa).
So all you have to do is flip that bit to toggle case - that would be xor 0b0100000
which is xor 0x20
.
To lower()
, you have to set that bit, so you can or 0b0100000
which is or 0x20
- the already mentioned or 0x60
also works, as 0x60 is 0b1100000 and that bit is set anyway.
And to upper, you have to unset that bit, that would be "and the inverse of the mask 0b0100000", which is the same as and 0x5f
.
To see it all in action, i wrote some python snipplets which check that what we just saw is true for every character in the english alphabet:
#toggle():
>>> ''.join(chr(ord(c)^0x20) for c in string.ascii_lowercase) == string.ascii_uppercase
True
>>> ''.join(chr(ord(c)^0x20) for c in string.ascii_uppercase) == string.ascii_lowercase
True
#lower():
>>> ''.join(chr(ord(c)|0x20) for c in string.ascii_lowercase) == string.ascii_lowercase
True
>>> ''.join(chr(ord(c)|0x20) for c in string.ascii_uppercase) == string.ascii_lowercase
#upper():
>>> ''.join(chr(ord(c)&0x5f) for c in string.ascii_lowercase) == string.ascii_uppercase
True
>>> ''.join(chr(ord(c)&0x5f) for c in string.ascii_uppercase) == string.ascii_uppercase
True
it does not do anything useful to ' '
, '\n'
, and such though!
Upvotes: 1