icza
icza

Reputation: 417642

If a String is in the list (given at compile-time): Is HashSet the fastest solution?

Given a fixed list of strings at compile-time e.g.:

"zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine"

Utilizing HashSet we have a very fast way (O(1)) to tell if a String provided at runtime is in the list of strings.

For example:

Set<String> SET = new HashSet<>(Arrays.asList( "zero", "one", "two", "three",
        "four", "five", "six", "seven", "eight", "nine"));

boolean listed = SET.contains("some-text");

Are there any other faster ways to tell if a String provided at runtime is in the initial list of strings (given at compile-time) or is HashSet the fastest solution?

Let's Formalize the Question

Given the interface below:

interface Checker {
    String[] VALUES = { "zero", "one", "two", "three", "four", "five", "six",
            "seven", "eight", "nine" };

    boolean contains(String s);
}

Provide the fastest possible implementation given that the values listed in Checker.VALUES will not be changed (so for example you can use those literals in your code if you want to).

Demonstration: HashSetChecker Implementation

The implementation which uses a HashSet would look like this:

class HashSetChecker implements Checker {
    private final Set<String> set = new HashSet<>(Arrays.asList(VALUES));
    @Override
    public boolean contains(String s) {
        return set.contains(s);
    }
}

Code to Test an Implementation

When testing, we want to test the Checker.contains() method with the initial interned strings, also with other interned strings (String literals) which will not be found, and also Strings that have the same values (are equals) but are not interned Strings. For this purpose the following CheckerTester.TESTS array will be used.

public class CheckerTester {
    private static final String[] TESTS = { "zero", "one", "two", "three",
            "four", "five", "six", "seven", "eight", "nine", "ten", "eleven",
            "twelve", "thirteen", "fourteen", "fifteen", "sixteen",
            "seventeen", "eighteen", "nineteen",

            new String("zero"), new String("one"), new String("two"),
            new String("three"), new String("four"), new String("five"),
            new String("six"), new String("seven"), new String("eight"),
            new String("nine"), new String("ten"), new String("eleven"),
            new String("twelve"), new String("thirteen"),
            new String("fourteen"), new String("fifteen"),
            new String("sixteen"), new String("seventeen"),
            new String("eighteen"), new String("nineteen") };

    public static void test(Checker checker) {
        final int N = 1_000_000;

        long start = System.nanoTime();

        for (int i = 0; i < N; i++)
            for (String test : TESTS)
                checker.contains(test);

        long end = System.nanoTime();

        System.out.printf("%s: %d ms\n", checker.getClass().getName(),
                (end - start) / 1_000_000);
    }
}

Upvotes: 4

Views: 581

Answers (3)

icza
icza

Reputation: 417642

Let's see some implementations:

First idea: HashSet with big capacity

Some might say that multiple Strings might end up in the same HashSet bucket, so let's use a big initial capacity:

class HashSetChecker2 implements Checker {
    private final Set<String> set = new HashSet<>(1000);
    { set.addAll(Arrays.asList(VALUES)); }

    @Override
    public boolean contains(String s) {
        return set.contains(s);
    }
}

Using HashMap instead of HashSet

HashSet uses HashMap in its implementation, let's just do the same: get rid of the "shell" HashSet:

class HashMapChecker implements Checker {
    private final Map<String, Object> map = new HashMap<>(1000);
    {
        for (String s : VALUES)
            map.put(s, s);
    }
    @Override
    public boolean contains(String s) {
        return map.containsKey(s);
    }
}

TreeSet

Some might say give TreeSet a try too (it's sorted so it might have a chance). I know it's O(log(n)), but n is small (10 in this case):

class TreeSetChecker implements Checker {
    private final Set<String> set = new TreeSet<>(Arrays.asList(VALUES));
    @Override
    public boolean contains(String s) {
        return set.contains(s);
    }
}

Short-circuit OR checks (if-else logic):

class OrChecker implements Checker {
    @Override
    public boolean contains(String s) {
        return "zero".equals(s) || "one".equals(s) || "two".equals(s)
                || "three".equals(s) || "four".equals(s) || "five".equals(s)
                || "six".equals(s) || "seven".equals(s) || "eight".equals(s)
                || "nine".equals(s);
    }
}

Reference equality then revert to short-circuit OR checks

Some might say that we should first check if by reference we have the String and if not then revert to short-circuit OR checks:

class RefOrChecker extends OrChecker {
    @Override
    public boolean contains(String s) {
        return "zero" == s || "one" == s || "two" == s || "three" == s
                || "four" == s || "five" == s || "six" == s || "seven" == s
                || "eight" == s || "nine" == s || super.contains(s);
    }
}

Using switch: fastest I've found so far

Since we have a fixed list of Strings known at compile-time, we can take advantage of the possibility that we can use Strings in switch statements.

We can add a case for each String from the fixed list and return true, and add a default case to return false.

class SwitchChecker implements Checker {
    @Override
    public boolean contains(String s) {
        switch (s) {
            case "zero":
            case "one":
            case "two":
            case "three":
            case "four":
            case "five":
            case "six":
            case "seven":
            case "eight":
            case "nine":
                return true;
            default:
                return false;
        }
    }
}

New finding: Embedded switches (2 switch blocks)

Maaartinus's answer about perfect hashing made me thinking. Even if we have a perfect hash, it still has to run on the whole content of the String provided at runtime which we want to check. So instead we should use something that is available right in the String: its length. Based on the length of the String we use a switch, and inside that switch we use an internal switch only listing the strings with the specified length. With this, we reduce the number of case statements inside a switch:

class EmbeddedSwitchChecker implements Checker {
    @Override
    public boolean contains(String s) {
        switch (s.length()) {
        case 3:
            switch (s) {
                case "one":
                case "two":
                case "six":
                    return true;
                default:
                    return false;
            }
        case 4:
            switch (s) {
                case "zero":
                case "four":
                case "five":
                case "nine":
                    return true;
                default:
                    return false;
            }
        case 5:
            switch (s) {
                case "three":
                case "seven":
                case "eight":
                    return true;
                default:
                    return false;
            }
        default:
            return false;
        }
    }
}

New finding: CharSwitchChecker: the Champion

This is basically the combination of an improved EmbeddedSwitchChecker and OldCurmudgeon's idea about a state machine: here we use the switch on the first character of the String (but first we check its length), and based on that we either narrowed down to one possible String or if not, we check the 2nd character too in which case the possible String can be only one (and we can decide by calling String.equals()):

class CharSwitchChecker implements Checker {
    @Override
    public boolean contains(String s) {
        final int length = s.length();
        if (length < 3 || length > 5)
            return false;

        switch (s.charAt(0)) {
        case 'z':
            return "zero".equals(s);
        case 'o':
            return "one".equals(s);
        case 't':
            return s.charAt(1) == 'w' ? "two".equals(s) : "three".equals(s);
        case 'f':
            return s.charAt(1) == 'o' ? "four".equals(s) : "five".equals(s);
        case 's':
            return s.charAt(1) == 'i' ? "six".equals(s) : "seven".equals(s);
        case 'e':
            return "eight".equals(s);
        case 'n':
            return "nine".equals(s);
        }
        return false;
    }
}

Test Results

Here are the test results:

                         TIME        HOW FAST (compared to HashSetChecker)
-----------------------------------------------------------------------------
HashSetChecker:          929 ms       1.00x
HashSetChecker2:         892 ms       1.04x
HashMapChecker:          873 ms       1.06x
TreeSetChecker:         2265 ms       0.41x
OrChecker:              1815 ms       0.51x
RefOrChecker:           1708 ms       0.54x
SwitchChecker:           538 ms       1.73x
EmbeddedSwitchChecker:   467 ms       1.99x
CharSwitchChecker:       436 ms       2.13x

The SwitchChecker solution is about 1.7 times faster, the EmbeddedSwitchChecker is 2 times faster and the champion CharSwitchChecker is about 2.13 times faster than the HashSetChecker implementation. As expected the HashSet with big initial capacity and the HashMap solutions are slightly faster, and all other solutions fall behind.

Complete Runnable Test Program (+All Listed Solutions)

The Complete Runnalbe Test Program plus All listed solutions are here in one box for those who want to try it or experiment with new implementations.

Edit: Following Luiggi Mendoza's suggestion for rules for performing a micro benchmark, I changed the main() method for testing. I execute the whole test twice, and I only analyze the 2nd result. Also since the tests don't create new objects in the loop, I see no reason to call System.gc() whatsoever.

import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
import java.util.TreeSet;

interface Checker {
    String[] VALUES = { "zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine" };

    boolean contains(String s);
}

class HashSetChecker implements Checker {
    private final Set<String> set = new HashSet<>(Arrays.asList(VALUES));

    @Override
    public boolean contains(String s) {
        return set.contains(s);
    }
}

class HashSetChecker2 implements Checker {
    private final Set<String> set = new HashSet<>(1000);
    {
        set.addAll(Arrays.asList(VALUES));
    }

    @Override
    public boolean contains(String s) {
        return set.contains(s);
    }
}

class HashMapChecker implements Checker {
    private final Map<String, Object> map = new HashMap<>(1000);
    {
        for (String s : VALUES)
            map.put(s, s);
    }

    @Override
    public boolean contains(String s) {
        return map.containsKey(s);
    }
}

class TreeSetChecker implements Checker {
    private final Set<String> set = new TreeSet<>(Arrays.asList(VALUES));

    @Override
    public boolean contains(String s) {
        return set.contains(s);
    }
}

class OrChecker implements Checker {
    @Override
    public boolean contains(String s) {
        return "zero".equals(s) || "one".equals(s) || "two".equals(s) || "three".equals(s)
                || "four".equals(s) || "five".equals(s) || "six".equals(s) || "seven".equals(s)
                || "eight".equals(s) || "nine".equals(s);
    }
}

class RefOrChecker extends OrChecker {
    @Override
    public boolean contains(String s) {
        return "zero" == s || "one" == s || "two" == s || "three" == s || "four" == s || "five" == s
                || "six" == s || "seven" == s || "eight" == s || "nine" == s || super.contains(s);
    }
}

class SwitchChecker implements Checker {
    @Override
    public boolean contains(String s) {
        switch (s) {
        case "zero":
        case "one":
        case "two":
        case "three":
        case "four":
        case "five":
        case "six":
        case "seven":
        case "eight":
        case "nine":
            return true;
        default:
            return false;
        }
    }
}

class EmbeddedSwitchChecker implements Checker {
    @Override
    public boolean contains(String s) {
        switch (s.length()) {
        case 3:
            switch (s) {
            case "one":
            case "two":
            case "six":
                return true;
            default:
                return false;
            }
        case 4:
            switch (s) {
            case "zero":
            case "four":
            case "five":
            case "nine":
                return true;
            default:
                return false;
            }
        case 5:
            switch (s) {
            case "three":
            case "seven":
            case "eight":
                return true;
            default:
                return false;
            }
        default:
            return false;
        }
    }
}

class CharSwitchChecker implements Checker {
    @Override
    public boolean contains(String s) {
        final int length = s.length();
        if (length < 3 || length > 5)
            return false;

        switch (s.charAt(0)) {
        case 'z':
            return "zero".equals(s);
        case 'o':
            return "one".equals(s);
        case 't':
            return s.charAt(1) == 'w' ? "two".equals(s) : "three".equals(s);
        case 'f':
            return s.charAt(1) == 'o' ? "four".equals(s) : "five".equals(s);
        case 's':
            return s.charAt(1) == 'i' ? "six".equals(s) : "seven".equals(s);
        case 'e':
            return "eight".equals(s);
        case 'n':
            return "nine".equals(s);
        }
        return false;
    }
}

public class CheckerTester {
    private static final String[] TESTS = { "zero", "one", "two", "three", "four", "five", "six", "seven",
            "eight", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen",
            "seventeen", "eighteen", "nineteen",

            new String("zero"), new String("one"), new String("two"), new String("three"),
            new String("four"), new String("five"), new String("six"), new String("seven"),
            new String("eight"), new String("nine"), new String("ten"), new String("eleven"),
            new String("twelve"), new String("thirteen"), new String("fourteen"), new String("fifteen"),
            new String("sixteen"), new String("seventeen"), new String("eighteen"), new String("nineteen") };

    public static void test(Checker checker) {
        final int N = 1_000_000;

        long start = System.nanoTime();

        for (int i = 0; i < N; i++)
            for (String test : TESTS)
                checker.contains(test);

        long end = System.nanoTime();

        System.out.printf("%s: %d ms\n", checker.getClass().getName(), (end - start) / 1_000_000);
    }

    public static void main(String args[]) {
        for (int i = 1; i <= 2; i++) {
            System.out.println("---- Check #" + i);
            test(new HashSetChecker());
            test(new HashSetChecker2());
            test(new HashMapChecker());
            test(new TreeSetChecker());
            test(new OrChecker());
            test(new RefOrChecker());
            test(new SwitchChecker());
            test(new EmbeddedSwitchChecker());
            test(new CharSwitchChecker());
        }
    }

}

Upvotes: 4

OldCurmudgeon
OldCurmudgeon

Reputation: 65811

One improvement on @icza's excellent suggestions is a state machine. Here's an implementation that looks like it may be quite efficient. Perhaps @icza will incorporate it into his time tests and we'll see how it fares.

Essentially it builds a static tree structure that can be traversed taking one step per character of the test string. If at any time the required step is not in the tree we can signal a mis-match. If we get to the end of the string we then check if this ending is legal.

This should be a O(k) algorithm (if there was such a thing) as it's run-time is linear to the length of the input string but there is clearly some set-up time.

public class Test {

    interface Checker {

        Set<String> VALUES = new HashSet<>(Arrays.asList("zero", "one", "two", "three", "four", "five", "six",
                "seven", "eight", "nine"));

        boolean contains(String s);
    }

    public static class HatChecker implements Checker {

        // Can't think of a name.
        static class Hats {

            // All possible children.
            Hats[] hats = new Hats[256];
            // Are we at the end of a word.
            boolean end = false;
        }

        // Root hats - contains one entry fr each possible fisrt characetr.
        static Hats root = new Hats();

        /**
         * Where should it go?
         */
        private static Hats find(String s, boolean grow) {
            Hats hats = root;
            for (int i = 0; i < s.length(); i++) {
                int ch = s.charAt(i);
                Hats newHats = hats.hats[ch];
                // Not seen this sequence yet?
                if (newHats == null) {
                    if (grow) {
                        // Allowed to grow.
                        newHats = hats.hats[ch] = new Hats();
                    } else {
                        // No growing - stop here.
                        return null;
                    }
                }
                hats = newHats;
            }
            return hats;
        }

        /**
         * Add to the structures.
         */
        private static void add(String s) {
            // Grow it and margk it good.
            find(s, true).end = true;
        }

        static {
            // Grow my structure.
            for (String s : VALUES) {
                add(s);
            }
        }

        @Override
        public boolean contains(String s) {
            // Find where it should be but don't grow.
            Hats found = find(s, false);
            // It's a match if it wa sthere and was an end.
            return found != null && found.end;
        }

    }

    private static class Check {

        private final String s;
        private final boolean matches;

        public Check(String s) {
            this.s = s;
            this.matches = Checker.VALUES.contains(s);
        }

        public String toString() {
            return "(" + s + ")=" + matches;
        }
    }
    private static final Check[] TESTS = {
        new Check("zero"),
        new Check("one"),
        new Check("two"),
        new Check("three"),
        new Check("four"),
        new Check("five"),
        new Check("six"),
        new Check("seven"),
        new Check("eight"),
        new Check("nine"),
        new Check("ten"),
        new Check("eleven"),
        new Check("twelve"),
        new Check("thirteen"),
        new Check("fourteen"),
        new Check("fifteen"),
        new Check("sixteen"),
        new Check("seventeen"),
        new Check("eighteen"),
        new Check("nineteen"),
        new Check(new String("zero")),
        new Check(new String("one")),
        new Check(new String("two")),
        new Check(new String("three")),
        new Check(new String("four")),
        new Check(new String("five")),
        new Check(new String("six")),
        new Check(new String("seven")),
        new Check(new String("eight")),
        new Check(new String("nine")),
        new Check(new String("ten")),
        new Check(new String("eleven")),
        new Check(new String("twelve")),
        new Check(new String("thirteen")),
        new Check(new String("fourteen")),
        new Check(new String("fifteen")),
        new Check(new String("sixteen")),
        new Check(new String("seventeen")),
        new Check(new String("eighteen")),
        new Check(new String("nineteen"))};

    public void timeTest(Checker checker) {
        System.out.println("Time");
        final int N = 1_000_000;

        long start = System.nanoTime();

        for (int i = 0; i < N; i++) {
            for (Check check : TESTS) {
                checker.contains(check.s);
            }
        }

        long end = System.nanoTime();

        System.out.printf("%s: %d ms\n", checker.getClass().getName(),
                (end - start) / 1_000_000);
    }

    public void checkerTest(Checker checker) {
        System.out.println("Checker");
        for (Check check : TESTS) {
            if (checker.contains(check.s) != check.matches) {
                System.err.println("Check(" + check + ") failed");
            }
        }
    }

    public static void main(String args[]) {
        try {
            Checker checker = new HatChecker();
            Test test = new Test();
            test.checkerTest(checker);
            test.timeTest(checker);
        } catch (Throwable ex) {
            ex.printStackTrace(System.err);
        }
    }
}

I suspect this will be on par with the case statement - It should be interesting though.

Sorry about the Hat naming - I just couldn't thing of a good name.

Upvotes: 1

maaartinus
maaartinus

Reputation: 46422

The fastest solution could be based on perfect hashing. Find a fast hash function mapping all your strings onto distinct integers in a small range and build a hash table. The hash table is fast as there are by definition no collisions. Finding the perfect hash function may take long and finding a fast one may be even harder.

One example where it works pretty well is Guava's CharMatcher.WHITESPACE, where all whitespace characters get mapped onto the set {0, ..., 31} by using a multiplication and a shift (see here for some explanation). The previous implementation used perfect hashing, too, but was way slower because of division.

Finding a fast perfect hash for your 10 strings and a table of size 16 should be rather simple.

Upvotes: 2

Related Questions