John O C
John O C

Reputation: 103

XOR two strings of different length

So I am trying to XOR two strings together but am unsure if I am doing it correctly when the strings are different length. The method I am using is as follows.

def xor_two_str(a,b):
xored = []
for i in range(max(len(a), len(b))):
    xored_value = ord(a[i%len(a)]) ^ ord(b[i%len(b)])
    xored.append(hex(xored_value)[2:])
return ''.join(xored)

I get output like so.

abc XOR abc: 000
abc XOR ab: 002
ab XOR abc: 5a
space XOR space: 0

I know something is wrong and I will eventually want to convert the hex value to ascii so am worried the foundation is wrong. Any help would be greatly appreciated.

Upvotes: 2

Views: 3850

Answers (1)

ShadowRanger
ShadowRanger

Reputation: 155576

Your code looks mostly correct (assuming the goal is to reuse the shorter input by cycling back to the beginning), but your output has a minor problem: It's not fixed width per character, so you could get the same output from two pairs characters with a small (< 16) difference as from a single pair of characters with a large difference.

Assuming you're only working with "bytes-like" strings (all inputs have ordinal values below 256), you'll want to pad your hex output to a fixed width of two, with padding zeroes changing:

xored.append(hex(xored_value)[2:])

to:

xored.append('{:02x}'.format(xored_value))

which saves a temporary string (hex + slice makes the longer string then slices off the prefix, when format strings can directly produce the result without the prefix) and zero-pads to a width of two.

There are other improvements possible for more Pythonic/performant code, but that should be enough to make your code produce usable results.

Side-note: When running your original code, xor_two_str('abc', 'ab') and xor_two_str('ab', 'abc') both produced the same output, 002 (Try it online!), which is what you'd expect (since xor-ing is commutative, and you cycle the shorter input, reversing the arguments to any call should produce the same results). Not sure why you think it produced 5a. My fixed code (Try it online!) just makes the outputs 000000, 000002, 000002, and 00; padded properly, but otherwise unchanged from your results.

As far as other improvements to make, manually converting character by character, and manually cycling the shorter input via remainder-and-indexing is a surprisingly costly part of this code, relative to the actual work performed. You can do a few things to reduce this overhead, including:

  1. Convert from str to bytes once, up-front, in bulk (runs in roughly one seventh the time of the fastest character by character conversion)
  2. Determine up front which string is shortest, and use itertools.cycle to extend it as needed, and zip to directly iterate over paired byte values rather than indexing at all

Together, this gets you:

from itertools import cycle

def xor_two_str(a,b):
    # Convert to bytes so we iterate by ordinal, determine which is longer
    short, long = sorted((a.encode('latin-1'), b.encode('latin-1')), key=len)
    xored = []
    for x, y in zip(long, cycle(short)):
        xored_value = x ^ y
        xored.append('{:02x}'.format(xored_value))
    return ''.join(xored)

or to make it even more concise/fast, we just make the bytes object without converting to hex (and just for fun, use map+operator.xor to avoid the need for Python level loops entirely, pushing all the work to the C layer in the CPython reference interpreter), then convert to hex str in bulk with the (new in 3.5) bytes.hex method:

from itertools import cycle
from operator import xor

def xor_two_str(a,b):
    short, long = sorted((a.encode('latin-1'), b.encode('latin-1')), key=len)
    xored = bytes(map(xor, long, cycle(short)))
    return xored.hex()

Upvotes: 3

Related Questions