Reputation: 33

Can two different strings when encoded with different encodings have the same byte sequence?

Can two different strings when encoded with different encodings have the same byte sequence? i.e. some "string one" and "string two" in the example below when encoded using two different encodings (Cp1252 and UTF-8 are just examples) will cause the test to pass?

import java.io.UnsupportedEncodingException;
import java.util.Arrays;

import org.junit.Assert;
import org.junit.Test;

public class EncodingTest {
    @Test
    public void test() throws UnsupportedEncodingException {
        final byte[] sequence1 = "string one".getBytes("Cp1252");
        final byte[] sequence2 = "string two".getBytes("UTF-8");
        Assert.assertTrue(Arrays.equals(sequence1, sequence2));
    }
}

A bug in my code hashes byte sequence generated from a String with JVM's default encoding and I need to verify whether that will cause hash collisions when the code is run with different strings and different JVM file encodings (which can happen when run on Windows and Linux for example).

Since an encoding is a mapping between byte sequences and characters, I think there may be some strings and encodings that pass the above test. But just wanted to know if there are any well known examples or some good reasons for why I shouldn't be relying on hash collisions not happening.

Thanks

PS: This is only for encodings supported by JDK 1.6 and not by some made up ones.

Upvotes: 3

Answers (5)

Jukka K. Korpela

Reputation: 201886

Yes. To take a simple example, the string “¡” (the inverted exclamation mark) encoded as ISO-8859-1 and the string “Ą” (capital A with ogoned) encodes as ISO-8859-2 both become the single-byte sequence A1 (hex). It is more or less obvious that such things happen when using the very simple encodings that map characters to single bytes; otherwise they would not be different encodings. It can surely happen when more complicated encoding schemes are involved, too.

Upvotes: 2

user166390

Reputation:

Yes, it is possible, at least for strings of different lengths.

The string "\u2020" (or "†") is encoded as 0x20,0x20 in UTF-16. This is also what "\x20\x20" (a string of two ASCII spaces) is encoded to in ASCII.

Of course, The Dagger, doesn't come up in language very often [=^_^=], but some standard [non-Latin] alphabets could generate similar byte-sequences that map onto a standard (non-control character) ASCII encoding .. and many more if the restriction about control characters relaxed.

It would would be more interesting to find a case where two similar "realistic" strings (e.g. same length and "sensible data") could map onto the same byte-sequence with different encodings ..

Upvotes: 1

Matt McHenry

Reputation: 20939

This code should produce an example eventually:

    while(true){
        Random r = new Random();
        byte[] bytes = new byte[4];
        r.nextBytes(bytes);
        try{
            String raw = Arrays.toString(bytes);
            String utf8 = new String(bytes, "UTF-8");
            String latin1 = new String(bytes, "ISO-LATIN-1");
            System.out.println(raw + " is " + utf8 + " or " + latin1);
            break;
        }catch(Exception e){}
    }

Upvotes: 1

Eric J.

Reputation: 150228

If the source string is in an encoding that supports multi-byte characters and the target encoding is one that does not support multi-byte characters, it seems reasonable that one could get a collision since multi-byte characters will require a mapping to a single byte character set.

For example if the input strings are written in Chinese and the target character set is US-ASCII, many Chinese characters will certainly be mapped to the same US-ASCII representation.

Upvotes: 1

dda

Reputation: 6213

Here's an easy one: most codepages and UTF-8 share the ASCII encoding (0x00 = 0x7F). If your text is in plain English, there's a big chance that it's in ASCII -- whatever the declared encoding, since it'd use mostly plain, non-accented characters.

Upvotes: 1

Can two different strings when encoded with different encodings have the same byte sequence?

Answers (5)

Related Questions