Reputation: 417642
Given a fixed list of strings at compile-time e.g.:
"zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine"
Utilizing HashSet
we have a very fast way (O(1)) to tell if a String
provided at runtime is in the list of strings.
For example:
Set<String> SET = new HashSet<>(Arrays.asList( "zero", "one", "two", "three",
"four", "five", "six", "seven", "eight", "nine"));
boolean listed = SET.contains("some-text");
Are there any other faster ways to tell if a String
provided at runtime is in the initial list of strings (given at compile-time) or is HashSet
the fastest solution?
Given the interface below:
interface Checker {
String[] VALUES = { "zero", "one", "two", "three", "four", "five", "six",
"seven", "eight", "nine" };
boolean contains(String s);
}
Provide the fastest possible implementation given that the values listed in Checker.VALUES
will not be changed (so for example you can use those literals in your code if you want to).
HashSetChecker
ImplementationThe implementation which uses a HashSet
would look like this:
class HashSetChecker implements Checker {
private final Set<String> set = new HashSet<>(Arrays.asList(VALUES));
@Override
public boolean contains(String s) {
return set.contains(s);
}
}
When testing, we want to test the Checker.contains()
method with the initial interned strings, also with other interned strings (String
literals) which will not be found, and also Strings that have the same values (are equals) but are not interned Strings. For this purpose the following CheckerTester.TESTS
array will be used.
public class CheckerTester {
private static final String[] TESTS = { "zero", "one", "two", "three",
"four", "five", "six", "seven", "eight", "nine", "ten", "eleven",
"twelve", "thirteen", "fourteen", "fifteen", "sixteen",
"seventeen", "eighteen", "nineteen",
new String("zero"), new String("one"), new String("two"),
new String("three"), new String("four"), new String("five"),
new String("six"), new String("seven"), new String("eight"),
new String("nine"), new String("ten"), new String("eleven"),
new String("twelve"), new String("thirteen"),
new String("fourteen"), new String("fifteen"),
new String("sixteen"), new String("seventeen"),
new String("eighteen"), new String("nineteen") };
public static void test(Checker checker) {
final int N = 1_000_000;
long start = System.nanoTime();
for (int i = 0; i < N; i++)
for (String test : TESTS)
checker.contains(test);
long end = System.nanoTime();
System.out.printf("%s: %d ms\n", checker.getClass().getName(),
(end - start) / 1_000_000);
}
}
Upvotes: 4
Views: 581
Reputation: 417642
Let's see some implementations:
HashSet
with big capacitySome might say that multiple Strings might end up in the same HashSet
bucket, so let's use a big initial capacity:
class HashSetChecker2 implements Checker {
private final Set<String> set = new HashSet<>(1000);
{ set.addAll(Arrays.asList(VALUES)); }
@Override
public boolean contains(String s) {
return set.contains(s);
}
}
HashMap
instead of HashSet
HashSet
uses HashMap
in its implementation, let's just do the same: get rid of the "shell" HashSet
:
class HashMapChecker implements Checker {
private final Map<String, Object> map = new HashMap<>(1000);
{
for (String s : VALUES)
map.put(s, s);
}
@Override
public boolean contains(String s) {
return map.containsKey(s);
}
}
TreeSet
Some might say give TreeSet
a try too (it's sorted so it might have a chance). I know it's O(log(n)), but n
is small (10 in this case):
class TreeSetChecker implements Checker {
private final Set<String> set = new TreeSet<>(Arrays.asList(VALUES));
@Override
public boolean contains(String s) {
return set.contains(s);
}
}
class OrChecker implements Checker {
@Override
public boolean contains(String s) {
return "zero".equals(s) || "one".equals(s) || "two".equals(s)
|| "three".equals(s) || "four".equals(s) || "five".equals(s)
|| "six".equals(s) || "seven".equals(s) || "eight".equals(s)
|| "nine".equals(s);
}
}
Some might say that we should first check if by reference we have the String
and if not then revert to short-circuit OR checks:
class RefOrChecker extends OrChecker {
@Override
public boolean contains(String s) {
return "zero" == s || "one" == s || "two" == s || "three" == s
|| "four" == s || "five" == s || "six" == s || "seven" == s
|| "eight" == s || "nine" == s || super.contains(s);
}
}
switch
: fastest I've found so farSince we have a fixed list of String
s known at compile-time, we can take advantage of the possibility that we can use String
s in switch
statements.
We can add a case
for each String
from the fixed list and return true
, and add a default
case to return false
.
class SwitchChecker implements Checker {
@Override
public boolean contains(String s) {
switch (s) {
case "zero":
case "one":
case "two":
case "three":
case "four":
case "five":
case "six":
case "seven":
case "eight":
case "nine":
return true;
default:
return false;
}
}
}
switch
blocks)Maaartinus's answer about perfect hashing made me thinking. Even if we have a perfect hash, it still has to run on the whole content of the String
provided at runtime which we want to check. So instead we should use something that is available right in the String
: its length. Based on the length of the String
we use a switch
, and inside that switch
we use an internal switch
only listing the strings with the specified length. With this, we reduce the number of case
statements inside a switch
:
class EmbeddedSwitchChecker implements Checker {
@Override
public boolean contains(String s) {
switch (s.length()) {
case 3:
switch (s) {
case "one":
case "two":
case "six":
return true;
default:
return false;
}
case 4:
switch (s) {
case "zero":
case "four":
case "five":
case "nine":
return true;
default:
return false;
}
case 5:
switch (s) {
case "three":
case "seven":
case "eight":
return true;
default:
return false;
}
default:
return false;
}
}
}
This is basically the combination of an improved EmbeddedSwitchChecker
and OldCurmudgeon's idea about a state machine: here we use the switch
on the first character of the String
(but first we check its length), and based on that we either narrowed down to one possible String
or if not, we check the 2nd character too in which case the possible String
can be only one (and we can decide by calling String.equals()
):
class CharSwitchChecker implements Checker {
@Override
public boolean contains(String s) {
final int length = s.length();
if (length < 3 || length > 5)
return false;
switch (s.charAt(0)) {
case 'z':
return "zero".equals(s);
case 'o':
return "one".equals(s);
case 't':
return s.charAt(1) == 'w' ? "two".equals(s) : "three".equals(s);
case 'f':
return s.charAt(1) == 'o' ? "four".equals(s) : "five".equals(s);
case 's':
return s.charAt(1) == 'i' ? "six".equals(s) : "seven".equals(s);
case 'e':
return "eight".equals(s);
case 'n':
return "nine".equals(s);
}
return false;
}
}
Here are the test results:
TIME HOW FAST (compared to HashSetChecker)
-----------------------------------------------------------------------------
HashSetChecker: 929 ms 1.00x
HashSetChecker2: 892 ms 1.04x
HashMapChecker: 873 ms 1.06x
TreeSetChecker: 2265 ms 0.41x
OrChecker: 1815 ms 0.51x
RefOrChecker: 1708 ms 0.54x
SwitchChecker: 538 ms 1.73x
EmbeddedSwitchChecker: 467 ms 1.99x
CharSwitchChecker: 436 ms 2.13x
The SwitchChecker
solution is about 1.7 times faster, the EmbeddedSwitchChecker
is 2 times faster and the champion CharSwitchChecker
is about 2.13 times faster than the HashSetChecker
implementation. As expected the HashSet
with big initial capacity and the HashMap
solutions are slightly faster, and all other solutions fall behind.
The Complete Runnalbe Test Program plus All listed solutions are here in one box for those who want to try it or experiment with new implementations.
Edit: Following Luiggi Mendoza's suggestion for rules for performing a micro benchmark, I changed the main()
method for testing. I execute the whole test twice, and I only analyze the 2nd result. Also since the tests don't create new objects in the loop, I see no reason to call System.gc()
whatsoever.
import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
import java.util.TreeSet;
interface Checker {
String[] VALUES = { "zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine" };
boolean contains(String s);
}
class HashSetChecker implements Checker {
private final Set<String> set = new HashSet<>(Arrays.asList(VALUES));
@Override
public boolean contains(String s) {
return set.contains(s);
}
}
class HashSetChecker2 implements Checker {
private final Set<String> set = new HashSet<>(1000);
{
set.addAll(Arrays.asList(VALUES));
}
@Override
public boolean contains(String s) {
return set.contains(s);
}
}
class HashMapChecker implements Checker {
private final Map<String, Object> map = new HashMap<>(1000);
{
for (String s : VALUES)
map.put(s, s);
}
@Override
public boolean contains(String s) {
return map.containsKey(s);
}
}
class TreeSetChecker implements Checker {
private final Set<String> set = new TreeSet<>(Arrays.asList(VALUES));
@Override
public boolean contains(String s) {
return set.contains(s);
}
}
class OrChecker implements Checker {
@Override
public boolean contains(String s) {
return "zero".equals(s) || "one".equals(s) || "two".equals(s) || "three".equals(s)
|| "four".equals(s) || "five".equals(s) || "six".equals(s) || "seven".equals(s)
|| "eight".equals(s) || "nine".equals(s);
}
}
class RefOrChecker extends OrChecker {
@Override
public boolean contains(String s) {
return "zero" == s || "one" == s || "two" == s || "three" == s || "four" == s || "five" == s
|| "six" == s || "seven" == s || "eight" == s || "nine" == s || super.contains(s);
}
}
class SwitchChecker implements Checker {
@Override
public boolean contains(String s) {
switch (s) {
case "zero":
case "one":
case "two":
case "three":
case "four":
case "five":
case "six":
case "seven":
case "eight":
case "nine":
return true;
default:
return false;
}
}
}
class EmbeddedSwitchChecker implements Checker {
@Override
public boolean contains(String s) {
switch (s.length()) {
case 3:
switch (s) {
case "one":
case "two":
case "six":
return true;
default:
return false;
}
case 4:
switch (s) {
case "zero":
case "four":
case "five":
case "nine":
return true;
default:
return false;
}
case 5:
switch (s) {
case "three":
case "seven":
case "eight":
return true;
default:
return false;
}
default:
return false;
}
}
}
class CharSwitchChecker implements Checker {
@Override
public boolean contains(String s) {
final int length = s.length();
if (length < 3 || length > 5)
return false;
switch (s.charAt(0)) {
case 'z':
return "zero".equals(s);
case 'o':
return "one".equals(s);
case 't':
return s.charAt(1) == 'w' ? "two".equals(s) : "three".equals(s);
case 'f':
return s.charAt(1) == 'o' ? "four".equals(s) : "five".equals(s);
case 's':
return s.charAt(1) == 'i' ? "six".equals(s) : "seven".equals(s);
case 'e':
return "eight".equals(s);
case 'n':
return "nine".equals(s);
}
return false;
}
}
public class CheckerTester {
private static final String[] TESTS = { "zero", "one", "two", "three", "four", "five", "six", "seven",
"eight", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen",
"seventeen", "eighteen", "nineteen",
new String("zero"), new String("one"), new String("two"), new String("three"),
new String("four"), new String("five"), new String("six"), new String("seven"),
new String("eight"), new String("nine"), new String("ten"), new String("eleven"),
new String("twelve"), new String("thirteen"), new String("fourteen"), new String("fifteen"),
new String("sixteen"), new String("seventeen"), new String("eighteen"), new String("nineteen") };
public static void test(Checker checker) {
final int N = 1_000_000;
long start = System.nanoTime();
for (int i = 0; i < N; i++)
for (String test : TESTS)
checker.contains(test);
long end = System.nanoTime();
System.out.printf("%s: %d ms\n", checker.getClass().getName(), (end - start) / 1_000_000);
}
public static void main(String args[]) {
for (int i = 1; i <= 2; i++) {
System.out.println("---- Check #" + i);
test(new HashSetChecker());
test(new HashSetChecker2());
test(new HashMapChecker());
test(new TreeSetChecker());
test(new OrChecker());
test(new RefOrChecker());
test(new SwitchChecker());
test(new EmbeddedSwitchChecker());
test(new CharSwitchChecker());
}
}
}
Upvotes: 4
Reputation: 65811
One improvement on @icza's excellent suggestions is a state machine. Here's an implementation that looks like it may be quite efficient. Perhaps @icza will incorporate it into his time tests and we'll see how it fares.
Essentially it builds a static tree structure that can be traversed taking one step per character of the test string. If at any time the required step is not in the tree we can signal a mis-match. If we get to the end of the string we then check if this ending is legal.
This should be a O(k)
algorithm (if there was such a thing) as it's run-time is linear to the length of the input string but there is clearly some set-up time.
public class Test {
interface Checker {
Set<String> VALUES = new HashSet<>(Arrays.asList("zero", "one", "two", "three", "four", "five", "six",
"seven", "eight", "nine"));
boolean contains(String s);
}
public static class HatChecker implements Checker {
// Can't think of a name.
static class Hats {
// All possible children.
Hats[] hats = new Hats[256];
// Are we at the end of a word.
boolean end = false;
}
// Root hats - contains one entry fr each possible fisrt characetr.
static Hats root = new Hats();
/**
* Where should it go?
*/
private static Hats find(String s, boolean grow) {
Hats hats = root;
for (int i = 0; i < s.length(); i++) {
int ch = s.charAt(i);
Hats newHats = hats.hats[ch];
// Not seen this sequence yet?
if (newHats == null) {
if (grow) {
// Allowed to grow.
newHats = hats.hats[ch] = new Hats();
} else {
// No growing - stop here.
return null;
}
}
hats = newHats;
}
return hats;
}
/**
* Add to the structures.
*/
private static void add(String s) {
// Grow it and margk it good.
find(s, true).end = true;
}
static {
// Grow my structure.
for (String s : VALUES) {
add(s);
}
}
@Override
public boolean contains(String s) {
// Find where it should be but don't grow.
Hats found = find(s, false);
// It's a match if it wa sthere and was an end.
return found != null && found.end;
}
}
private static class Check {
private final String s;
private final boolean matches;
public Check(String s) {
this.s = s;
this.matches = Checker.VALUES.contains(s);
}
public String toString() {
return "(" + s + ")=" + matches;
}
}
private static final Check[] TESTS = {
new Check("zero"),
new Check("one"),
new Check("two"),
new Check("three"),
new Check("four"),
new Check("five"),
new Check("six"),
new Check("seven"),
new Check("eight"),
new Check("nine"),
new Check("ten"),
new Check("eleven"),
new Check("twelve"),
new Check("thirteen"),
new Check("fourteen"),
new Check("fifteen"),
new Check("sixteen"),
new Check("seventeen"),
new Check("eighteen"),
new Check("nineteen"),
new Check(new String("zero")),
new Check(new String("one")),
new Check(new String("two")),
new Check(new String("three")),
new Check(new String("four")),
new Check(new String("five")),
new Check(new String("six")),
new Check(new String("seven")),
new Check(new String("eight")),
new Check(new String("nine")),
new Check(new String("ten")),
new Check(new String("eleven")),
new Check(new String("twelve")),
new Check(new String("thirteen")),
new Check(new String("fourteen")),
new Check(new String("fifteen")),
new Check(new String("sixteen")),
new Check(new String("seventeen")),
new Check(new String("eighteen")),
new Check(new String("nineteen"))};
public void timeTest(Checker checker) {
System.out.println("Time");
final int N = 1_000_000;
long start = System.nanoTime();
for (int i = 0; i < N; i++) {
for (Check check : TESTS) {
checker.contains(check.s);
}
}
long end = System.nanoTime();
System.out.printf("%s: %d ms\n", checker.getClass().getName(),
(end - start) / 1_000_000);
}
public void checkerTest(Checker checker) {
System.out.println("Checker");
for (Check check : TESTS) {
if (checker.contains(check.s) != check.matches) {
System.err.println("Check(" + check + ") failed");
}
}
}
public static void main(String args[]) {
try {
Checker checker = new HatChecker();
Test test = new Test();
test.checkerTest(checker);
test.timeTest(checker);
} catch (Throwable ex) {
ex.printStackTrace(System.err);
}
}
}
I suspect this will be on par with the case
statement - It should be interesting though.
Sorry about the Hat
naming - I just couldn't thing of a good name.
Upvotes: 1
Reputation: 46422
The fastest solution could be based on perfect hashing. Find a fast hash function mapping all your strings onto distinct integers in a small range and build a hash table. The hash table is fast as there are by definition no collisions. Finding the perfect hash function may take long and finding a fast one may be even harder.
One example where it works pretty well is Guava's CharMatcher.WHITESPACE
, where all whitespace characters get mapped onto the set {0, ..., 31}
by using a multiplication and a shift (see here for some explanation). The previous implementation used perfect hashing, too, but was way slower because of division.
Finding a fast perfect hash for your 10 strings and a table of size 16 should be rather simple.
Upvotes: 2