WSBT
WSBT

Reputation: 36343

Why is HashSet<int> much slower than HashSet<String> in dart?

I noticed a HashSet<int> performing very slowly when working on a Flutter project. I had about 20,000 integers in a Set, and checking set.contains() took a very long time. But when I use toString() to convert all items to string, it performed 1000x faster.

I then tried to create a minimal reproducible code with 10 million random integers, but I couldn't reproduce the issue. Turns out, something special about these data caused the extreme slowness. I've attached a test code (and data) at the end of this question.

How to reproduce:

First, click "add int" button to add 14790 integers to a set. Then click "query int" (runs set.contains(123)) and "query string" (runs set.contains('123')). Observe that: 1. both operations are super slow; 2. the int version is slower than the string version. Picture:

int test

Then click "clear items", then "add string" to add the toString() version of the data. Then click "query int" and "query string" again, notice how much faster it becomes. Picture:

string test

Lastly, click both "add int" and "add string" to create a mixed set (with twice the entries). Observe that the querying times dropped in half for the int version, as if the faster strings helped "dilute" the problem. Picture:

int string mix test

I've had several friends running the same test code on various machines (intel i5, apple M1, snapdragon), timings are different but the conclusions are the same.

What's not the answer here:

Here are some things I considered, but they couldn't explain what's happening with some more tests.

  1. Maybe int needs boxing, whereas string is already an object?

That's probably not the issue here. With 1 million randomly generated values, ints performed faster than strings.

  1. string is immutable so their hash value could be cached?

I don't know if they are cached, but this doesn't explain the results observed with 1 million randomly generated values.

  1. int hash resulted in a lot of collisions?

I tried to print out .hashCode for all ints and strings in the data set, and verified they are all unique.

Test code:

The full test code with data is too long for StackOverflow, I've put it here https://pastebin.com/raw/4fm2hKQB instead.

So yeah, I'm lost, if anyone could help me understand what's going on that'll be greatly appreciated!

Upvotes: 7

Views: 1153

Answers (2)

Stephen
Stephen

Reputation: 471

I commented on the issue in the Dart repo. For completeness I will mention the 'answer' part of the comment here.

The implementations of HashSet and LinkedHashSet make the assumption that the key.hashCode values are 'good' hash codes that are reasonably distributed over a range of integers so that the lower N bits do not collide or nearly collide to 'bunch up' in the hash-table. Unfortunately int.hashCode does not have this property as it is effectively the identity function.

Things go wrong when the lower bits of all the keys are the same (or have only a few of the possible values) so taking the lower N bits gives the same effective hash code value. This is just the power-of-two version of the % 1000 example mentioned by @ch271828n.

@ch271828n mentions using a different hashCode. This is probably the best short-term solution. Use LinkedHashSet(hashCode: dispersedHashCode) with something like this:

int dispersedHashCode(e) { // untested!
  int hash = e.hashCode;
  // Odd number with 30%-50% of the bits set in an irregular pattern.
  hash *= 0x1736B4D29;
  hash += hash >>> 20;
  // maybe do it again to let bits higher that 20 influence the low bits.
  return hash;
}

Something like this would ideally be built into the core library hashed structures. This might take a long time since, realistically, a performance issue with a simple work-around will be likely be prioritized behind security bugs, incorrect behaviour bugs, performance issues with no work-around, and new features that enable customers to do things that are otherwise impossible to difficult to do.

A completely different approach would be to use an ordered Set like SplayTreeSet.

Upvotes: 5

ch271828n
ch271828n

Reputation: 17617

I am also considering hash collision problem.

int hash resulted in a lot of collisions? I tried to print out .hashCode for all ints and strings in the data set, and verified they are all unique.

Well, "all unique" does not mean "there is no collision". For a hash set, the number of bins are much less than the number of hashcode. For example, suppose you have a hash set with 1000 bins, and the mapping from hashCode to bin index is a simple bin index := hashCode % 1000, and suppose your data has hashCode like 0, 1000, 2000, 3000 etc. In this artificial case, your data has all unique hashCode, but they all fall into the first bin out of the 1000 bins. Huge collision!

A simple approach to debug whether it is the problem of hashcode: Re-run the program with LinkedHashSet(hashCode: (e) => some_other_hash_approach(e), equals: ...). By using such a new hash set, you can test on other hashCode generating functions. If some hashCode generating functions do not result in the same extremely slow speed, it is highly because of the original hashCode function which causes collision.

In addition, you can even use the same hashCode method for both the int and the String case. Then you guarantee that both cases have exactly the same collision behavior. Then it is easy to see whether collision is the cause, or is unrelated.

Another debug approach: Look at the C++ source code of LinkedHashSet, and see what algorithms it uses to assign data to bins. Then check whether collision as mentioned above happens or not.

A third debug method: Compile the pure-Dart program into an executable, and use profilers like perf to run it. Then you can see which code is hottest and consume most of the time. You may need debug symbols of Dart's native C++ code, which should be fetchable.

Upvotes: 0

Related Questions