Streams with TreeMap return incoherent results

Question

I am trying to solve the following exercise from "Core Java for the Impatient" by Cay Horstmann:

When an encoder of a Charset with partial Unicode coverage can’t encode a character, it replaces it with a default—usually, but not always, the encoding of "?". Find all replacements of all available character sets that support encoding. Use the newEncoder method to get an encoder, and call its replacement method to get the replacement. For each unique result, report the canonical names of the charsets that use it.

For the sake of education, I have decided to tackle the exercise with gargantuan one-liner using the streaming API, even though - in my opinion - a cleaner solution would divide the calculations into a number of steps, with intermediate variables in-between (certainly it would ease the debugging). Without further ado, here is a monster of code I have created:

   Charset.availableCharsets().values().stream().filter(charset -> charset.canEncode()).collect(
            Collectors.groupingBy(
                    charset -> charset.newEncoder().replacement(),
                    () -> new TreeMap<>((arr1, arr2) -> Arrays.equals(arr1, arr2) == true ? 0 : Integer.compare(arr1.hashCode(), arr2.hashCode())),
                    Collectors.mapping( charset -> charset.name(), Collectors.toList()))).
            values().stream().map(list -> list.stream().collect(Collectors.joining(", "))).forEach(System.out::println);

Basically, we take into account only the charsets that canEncode; create a Map with replacement as key and a list of canonical names as values; because grouping didn't work for arrays with default implementation of groupingBy, which uses HashMap, I have decided to use TreeMap. We then work with the Lists of canonical names, join them with comma and print.

Unfortunately, I have found it to give incoherent results. If I run the function twice in the same program, the first instance returns results consisting of 23 Strings, the second one - just 21 Strings. I suspect it has something to do with a poor implementation of Comparator for TreeMap, which was defined as follows:

((arr1, arr2) -> Arrays.equals(arr1, arr2) == true ? 0 : Integer.compare(arr1.hashCode(), arr2.hashCode()))

If that is the cause, what should be a proper Comparator in this case? Apart from that, can the one-liner be improved in any way?

I am also curious if such convoluted constructs as the code I have written are encountered in professional programs? Maybe it's only me who find it unreadable?

Holger · Accepted Answer

There is no guarantee that the hash code of two distinct instances will be different. That would be an ideal situation, but is never guaranteed. Only the opposite is true: if two objects are equal, they have the same hash code.

So if you create a comparator that considers the objects to be the same when they have the same hash code, arbitrary objects might be considered to be the same. Since the byte[] arrays returned by replacement() are defensive copies, read temporary objects, the result may vary in every run of this code.

Further, since the hash code of an array has nothing to do with its content, your comparator violates the transitivity rule: two arrays with equal content are supposed to be the same, but since they might/very likely have different hash codes, they have a different relation when being compared with a third array, not having the same content, a == b, but a < c and b > c. This is the reason why even equal arrays, which you compare by Arrays.equals can end up in different groups, as the TreeSet failed to find the existing key when comparing with other keys then.

If you want the arrays to be compared by value, you can use:

Charset.availableCharsets().values().stream().filter(Charset::canEncode).collect(
    Collectors.groupingBy(
            charset -> charset.newEncoder().replacement(),
            () -> new TreeMap<>(Comparator.comparing(ByteBuffer::wrap)),
            Collectors.mapping(Charset::name, Collectors.joining(", "))))
    .values().forEach(System.out::println);

ByteBuffers are Comparable and consistently evaluate the contents of the wrapped array.

I moved the Collectors.joining collector into the grouping collector to avoid the creation of the temporary List whose contents you are going to join afterwards anyway.

By the way, never use code like expression == true. There is no reason to append == true as expression is already sufficient.

Since you are only interested in the values, in other words, don’t need the keys to be of a certain type, you may wrap all arrays beforehand, simplifying the operation and even make it slightly more efficient:

Charset.availableCharsets().values().stream().filter(Charset::canEncode).collect(
    Collectors.groupingBy(
            charset -> ByteBuffer.wrap(charset.newEncoder().replacement()),
            TreeMap::new,
            Collectors.mapping(Charset::name, Collectors.joining(", "))))
    .values().forEach(System.out::println);

This change even allows resorting to hashing, if no consistent iteration order is required:

Charset.availableCharsets().values().stream().filter(Charset::canEncode).collect(
    Collectors.groupingBy(
            charset -> ByteBuffer.wrap(charset.newEncoder().replacement()),
            Collectors.mapping(Charset::name, Collectors.joining(", "))))
    .values().forEach(System.out::println);

This works, because ByteBuffer also implements equals and hashCode.

Streams with TreeMap return incoherent results

Answers (1)

Related Questions