Matthew Plascencia
Matthew Plascencia

Reputation: 69

Correctly embedding and extracting Unicode data from an image

I recently asked a question about embedding data into an image. I promptly solved that issue with help from other forums. I have run into a new problem: my program works fine for all Latin characters and even ones with diacritical (accent) marks. My program fails for other unicode characters, such as thse from scripts like Cyrillic, Greek, and Arabic.

Here is the code for my current program that cannot do unicode.

def embed_text(self, image_path, text, output_path):
    # Convert text message to binary format
    binary_message = ''.join(format(ord(char), '08b') for char in text) #aligning the bytes for embedding the message 
    # Load the image
    image = Image.open(image_path)
    w, h = image.size

    # Calculate the number of embedding characters (eN)
    eN = (h * w * 3) // 8
    if len(text) > eN:
        raise ValueError("Message too long to fit in the image")

    # Embedding loop
    message_index = 0
    for i in range(h):
        for j in range(w):
            pixel = list(image.getpixel((j, i)))

            for k in range(3):  # For R, G, B components
                if message_index < len(binary_message):
                    M = int(binary_message[message_index])
                    # Perform XOR operation with the 7th bit of the RGB component
                    pixel[k] = (pixel[k] & 0xFE) | (((pixel[k] >> 1) & 1) ^ M)
                    message_index += 1
                else:
                    break  # No more message bits to embed

            image.putpixel((j, i), tuple(pixel))

    # Save the Stego Image
    image.save(output_path)

def xor_substitution(self, component, bit):
    # Perform XOR on the least significant bit of the component with the bit
    return (component & 0xFE) | (component & 1) ^ bit

def extract_text(self, image_path):
    stego_image = Image.open(image_path)
    w, h = stego_image.size
    binary_message = ""

    for i in range(h):
        for j in range(w):
            pixel = stego_image.getpixel((j, i))
            for k in range(3):
                binary_message += str(pixel[k] & 1)

    # Extract only up to the NULL character
    end = binary_message.find('00000000')
    if end != -1:
        binary_message = binary_message[:end]

    return self.binary_to_string(binary_message)

def binary_to_string(self, binary_message):
    text = ""
    for i in range(0, len(binary_message), 8):
        byte = binary_message[i:i+8]
        text += chr(int(byte, 2))
    return text

As I said, these methods do well at embedding and extracting Latin text from the images I throw at it. When I try to embed things like

το

قي

8

I get things like

ñ;

ÈY♣

[]

In attempts to fix this issue, I have changed the 8-bit values in the lines that denote '08b' to '16b'. I found that the program still manages to embed things into images but takes out Japanese Kanji or Chinese characters.

Heere is the code I changed:

def embed_text(self, image_path, text, output_path):
    # Convert text message to binary format
    binary_message = ''.join(format(ord(char), '16b') for char in text)

    # Load the image
    image = Image.open(image_path)
    w, h = image.size

    # Calculate the number of embedding characters (eN)
    eN = (h * w * 3) // 16
    if len(text) > eN:
        raise ValueError("Message too long to fit in the image")
    binary_message = binary_message.ljust(eN * 16, '0')


    # Embedding loop
    message_index = 0
    for i in range(h):
        for j in range(w):
            pixel = list(image.getpixel((j, i)))

            for k in range(3):  # For R, G, B components
                if message_index < len(binary_message):
                    M = int(binary_message[message_index:message_index+16], 2)
                    # Perform XOR operation with the 7th bit of the RGB component
                    pixel[k] = (pixel[k] & 0xFFFE) | (((pixel[k] >> 1) & 1) ^ M)
                    message_index += 16
                else:
                    break  # No more message bits to embed

            image.putpixel((j, i), tuple(pixel))

    # Save the Stego Image
    image.save(output_path)

def xor_substitution(self, component, bit):
    # Perform XOR on the least significant bit of the component with the bit
    return (component & 0xFE) | (component & 1) ^ bit

def extract_text(self, image_path):
    stego_image = Image.open(image_path)
    w, h = stego_image.size
    binary_message = ""

    for i in range(h):
        for j in range(w):
            pixel = stego_image.getpixel((j, i))
            for k in range(3):
                binary_message += format(pixel[k] & 1, 'b').zfill(16)[-1]

    # Extract only up to the NULL character
    end = binary_message.find('00000000')
    if end != -1:
        binary_message = binary_message[:end]

    return self.binary_to_string(binary_message)

def binary_to_string(self, binary_message):
    text = ""
    for i in range(0, len(binary_message), 16):
        byte = binary_message[i:i+16]
        text += chr(int(byte, 2))
    return text

I'd like to know how I can fix these issues my program has since I need to have this implementation finished by the 5th of December. Thanks in advance for your assistance!

Upvotes: 0

Views: 216

Answers (1)

Mark Tolonen
Mark Tolonen

Reputation: 177971

Minimal changes to original code to make it work, with comments:

# Added missing import
from PIL import Image

# removed "self" from arguments.
def embed_text(image_path, text, output_path):
    # encode the message in UTF-8 and add a null.
    message = text.encode() + b'\x00'
    binary_message = ''.join(format(byte, '08b') for byte in message)
    image = Image.open(image_path)
    w, h = image.size

    # Calculate the number of embedding characters (eN)
    eN = (h * w * 3) // 8
    # Check that encoded message has enough space in image
    if len(message) > eN:
        raise ValueError("Message too long to fit in the image")

    message_index = 0
    for i in range(h):
        for j in range(w):
            pixel = list(image.getpixel((j, i)))

            for k in range(3):  # For R, G, B components
                if message_index < len(binary_message):
                    M = int(binary_message[message_index])
                    # Set bit 0 of the RGB component to the message bit
                    pixel[k] = (pixel[k] & 0xFE) | M
                    message_index += 1
                else:
                    break

            image.putpixel((j, i), tuple(pixel))

    image.save(output_path)

def extract_text(image_path):
    stego_image = Image.open(image_path)
    w, h = stego_image.size
    binary_message = ""

    for i in range(h):
        for j in range(w):
            pixel = stego_image.getpixel((j, i))
            for k in range(3):
                binary_message += str(pixel[k] & 1)

    # The NULL find here could false locate 8 zero bits that are parts of two bytes so removed.

    # removed "self." from function call.
    return binary_to_string(binary_message)

# removed "self" parameter.
def binary_to_string(binary_message):
    # Use a byte array to store extracted bytes
    text = bytearray()
    for i in range(0, len(binary_message), 8):
        byte = binary_message[i:i+8]
        # stop on NULL byte
        if byte == '00000000': break
        # add extracted byte
        text.append(int(byte, 2))
    # decode the message
    return text.decode()

# testing.  Use appropriate input PNG file.
embed_text('in.png', 'Hello, world! 世界您好!', 'out.png')
print(extract_text('out.png'))

More efficient algorithm that removes int/str/int conversions and only processes input file up to the null:

from PIL import Image

def embed_text(image_path, text, output_path):
    # Convert text message to bytes format and add null
    message_bytes = text.encode() + b'\x00'

    # Load the image and extract the bytes into a mutable array
    image = Image.open(image_path)
    image_bytes = bytearray(image.tobytes())

    # Need 8 bytes of image to store 8 bits of a message byte
    if len(message_bytes) * 8 > len(image_bytes):
        raise ValueError("Message too long to fit in the image")

    # Embedding loop:
    # 1. Clear the image byte MSB.
    # 2. Compute the message_byte/bit indices.
    # 3. Compute the message bit and set the image byte LSB.
    for i in range(len(message_bytes) * 8):
        image_bytes[i] &= 0xFE
        message_index, bit_index = divmod(i, 8)
        image_bytes[i] |= (message_bytes[message_index] >> bit_index) & 1

    # Load the updated bytes into the image and save
    image.frombytes(image_bytes)
    image.save(output_path)

def extract_text(image_path):
    stego_image = Image.open(image_path)
    image_bytes = stego_image.tobytes()
    message_bytes = bytearray()
    message_byte = 0

    # Extraction loop:
    # 1. Extract LSB and store in message_byte at the correct index
    # 2. Once 8 bits are extracted (indices 0-7):
    #    IF message_byte is not null:
    #      append message byte to message and clear message byte
    #    ELSE stop processing image bytes.
    for i, byte in enumerate(image_bytes):
        bit_index = i % 8
        message_byte |= (byte & 1) << bit_index
        if bit_index == 7:
            if message_byte:  # if not null
                message_bytes.append(message_byte)
                message_byte = 0
            else:
                break
    try:
        return message_bytes.decode()
    except UnicodeDecodeError:
        return 'No valid message found.'

embed_text('in.png', 'Hello, world! 世界您好!', 'out.png')
print(extract_text('out.png'))

Upvotes: 0

Related Questions