Reputation: 111

C++ fastest data structure for multiple searches

Coding in C++. I need a data structure for a bunch of sorted strings. I will be inserting all the strings into it in one go and not update it, but I will be searching for strings very often. All I need to see if a give string exists in the structure or not. I am expecting the list to be about a 100 strings. What would be a faster structure? I was thinking hashmap at first but I saw somewhere that for such a small number of elements, a binary search over a vector would work better (since they're sorted).

Upvotes: 11

Answers (9)

minika woon

Reputation: 68

You may try binary index array, it is c library index struct member field.

The tutorial blog is here https://minikawoon.quora.com/How-to-search-data-faster-on-big-amount-of-data-in-C-C++

Sample: -

Step 1. define your struct

typedef struct {
  char book_name[30];
  char book_description[61];
  char book_categories[9];
  int book_code;  
} my_book_t;

// 160000 size, 10 index field slot
bin_array_t *all_books = bin_array_create(160000, 10);

Step 2. Add index

if (bin_add_index(all_books, my_book_t, book_name, __def_cstr_sorted_cmp_func__)
&& bin_add_index(all_books, my_book_t, book_categories, __def_cstr_sorted_cmp_func__)
&& bin_add_index(all_books, my_book_t, book_code, __def_int_sorted_cmp_func__)
   ) {

Step 3. Initialized your data

    my_book_t *bk = malloc(sizeof(my_book_t));
    strcpy(bk->book_name, "The Duck Story"));
    ....
    ...
    bin_array_push(all_books, bk );

Step 4. Search Result eq, lt(less than), gt(greater than)

int data_search = 100;
bin_array_rs *bk_rs= (my_book_t*) ba_search_eq(all_books, my_book_t,             
book_code, &data_search);
my_book_t **bks = (my_book_t**)bk_rs->ptrs; // Convert to pointer array
// Loop it
for (i = 0; i < bk_rs->size; i++) {  
   address_t *add = bks[i];
    ....
}

Step 5. Multiple search and inner join or union

 // Join Solution
bin_array_rs *bk_rs=bin_intersect_rs(
    bin_intersect_rs(ba_search_gt(...), ba_search_lt(...), true),
    bin_intersect_rs(ba_search_gt(...), ba_search_lt(....), true),
                             true);

 // Union Solution
bin_array_rs *bk_rs= bin_union_rs(
    bin_union_rs(ba_search_gt(...), ba_search_lt(...), true),
    bin_union_rs(ba_search_gt(...), ba_search_lt(....), true),
                             true);

Read the document for more details about how it search and release memory after search.

Upvotes: 0

unknown

Reputation: 376

It depends on how different your strings are or what particular shape they have.

I think a hashmap is a good idea, if you're willing to take in the memory overhead. For only about 100 strings, the first character is enough:

String* myStrings[256];

You simply look at the first char of your string to determine in which array it could be.

If your strings are heterogenous enough (i.e they don't usually start with the same letter), the gain is theorically 256x speed. The loss an is additional 257 pointers (257*64 = 16448 bits) in memory. You can compensate a bit for that loss by removing the first character from the actual stored strings.

If you decide to scale up to 2 characters or more, both the advantages and the inconvenient are exponential.

String* myStrings[256][256][256];

However if your strings are special and cannot for example start with any char or contain any char, then you can reduce the array and map the used chars to a slot.

char charToSlot[256]; 
String* myStrings[3];

For example in this case if your strings can only start with the characters 100, 235 and 201, then charToSlot[100] = 0, charToSlot[235] = 1 and charToSlot[201] = 2.

Looking up the index is slightly slower but the memory impact is minimal. That could help you if the strings you manipulate can only contain the alphabet in lowercase. Then your ideal structure for one character would be:

char charToSlot[256]; 
String* myStrings[26];

And it can be scaled up more easily:

char charToSlot[256]; 
String* myStrings[26][26][26];

If you don't want to make any assumptions about your strings (i.e they can contain anything) then you could implement some dynamic indexing (indexes are added as soon as they are needed, and the array needs to be realloacated constantly).

char charToSlot[256]; 
String**** myStrings;

Another trick, if your strings vary in length and are pretty small (5-30 length) you could add an additional index that would again multiply the speed by only searching for the strings with the same length.

String* myStrings[30][256][256]...

If you think those solutions are too heavy, then you can take a more statistical approach. You could give the same branch to several characters. For example 'a', 'b', 'c' and 'd' would all go down the same way, and you'd have 4 times less branches. Then you'd arrive to the list and check again, char by char, if a string is equal, with an increased chance of obtaining what you want.

For example, if you strings can contain all 256 characters, but you don't want 256 but rather like 8 branches, you'd have :

String* myStrings[8];

And for any character you'd simply divide it by 32 (very fast) to pick the branch. This is probably the solution I'd recommend for your problem, since you have only about 100 strings and you probably don't want a huge array.

Also this one scales up more nicely:

String* myStrings[8][8][8][8]...

But then the arrays stored could have 32 times more strings and the content isn't deterministic.

Again, all depends of the particular properties of your strings, and more importantly of how many strings you have. For a really huge string database, nobody would care about even a Terabit of mapping overhead if it improves the search speed by a gargantuan factor and remove 99.99% of the iterating.

Upvotes: 2

Prem KTiw

Reputation: 605

Trie is the best solution for you. I say this because you don't have much strings, so going this way would be better. You can look at my implementation of trie here at my github link
https://github.com/prem-ktiw/Algorithmic-Codes/blob/master/Trie/char_trie.cpp
The code is well commented and will allow you to insert a string in linear time, and search as well in linear time. No collision issues as seen in hashing.
Dynamic allocation has been used so memory will not be an issue.
The only thing is that you cannot have multiple duplicate copies of same string in my implementation, and there is no record of how many copies of a string is there in the trie.
I would like to hear from you on this, in case any assistance is needed.

Upvotes: 1

BeeOnRope

Reputation: 64925

Assuming you are talking about "full sized" CPUs¹, a binary search through strings, even with only 100 elements is likely to be quite slow, relative to other solutions at least. You may suffer several branch mispredicts for every search, and will probably end up examining each character in the input string several times (since you need to repeatedly strcmp at each node in the binary search).

As someone already pointed out, the only real way to know is to measure - but to do that you still need to be able to figure out what the candidates are in the first place! Furthermore, it's not always possible to measure in a realistic scenario, since might not even know such a scenario (imagine, for example, designing a library function which be widely used across many different cases).

Finally, understanding what is likely to be fast lets you both eliminate candidates you know will perform badly, and allows you to double-check your test results with your intuition: if something is much slower than you expected, it's worth checking why (did the compiler do something stupid), and if something is much faster then maybe it's time to update your intuition.

So I'll try to actually take a stab at what's going to be fast - assuming speed really matters here, and you can spend some time validating a complex solution. As a baseline, a straightforward implementation will probably take 100 ns, and a really optimized one perhaps 10 ns. So if you spend 10 hours of engineering time on this, you'll have to call this function 400 billion times just to earn your 10 hours back⁵. When you factor in the risk of bugs, the maintenance complexity and other overheads, you are going to want to make sure you are calling this function many trillions of times before trying to optimize it. Such functions are rare, but they certainly exist⁴.

That said, you are missing a lot of information that would be needed to help design a very fast solution, such as:

Is your input to the search function a std::string or const char * or something else?
What is the average and maximum string length?
Will most of your searches be successful or unsuccessful?
Can you accept some false positives?
Is the set of strings known at compile time, or are you OK with a long initialization phase?

The answers to above can help you partition the design space as described below.

Bloom Filters

If per (4) you can accept a (controllable) number of false positives², or per (3) most of your searches will be unsuccessful, then you should consider a Bloom Filter. For example, you could use a 1024 bit (128 byte) filter, and use a 60-bit hash of the string to index into it with 6 10-bit functions. This gives a < 1% false positive rate.

This has the advantage that outside of the hash calculation it's independent of the lengths of the strings, and it doesn't depend on the matching behavior (e.g., a search that relies on repeated string comparison will be slower if the strings tend to have long common prefixes).

If you can accept false positives, you are done - but in the case that you need it to be always correct but expect mostly unsuccessful searches, you use it as a filter: if the bloom filter returns false (the usual case) you are done, but if it returns true, you need to double check in one of the always-correct structures discussed below. So the common case is fast, but the right answer is always returned.

Perfect Hash

If the set of ~100 strings is known at compile time, or you are OK doing some one-time heavy work to pre-process the strings, you could consider a perfect hash. If you have a compile-time known search set, you can just slap the strings into gperf and it will spit out a hash function and lookup table.

For example, I just fed 100 random English words³ into gperf and it generated a hash function that only needs to look at two characters to uniquely distinguish every word, like this:

static unsigned int hash (const char *str, unsigned int len)
{
  static unsigned char asso_values[] =
    {
      115, 115, 115, 115, 115,  81,  48,   1,  77,  72,
      115,  38,  81, 115, 115,   0,  73,  40,  44, 115,
       32, 115,  41,  14,   3, 115, 115,  30, 115, 115,
      115, 115, 115, 115, 115, 115, 115,  16,  18,   4,
       31,  55,  13,  74,  51,  44,  32,  20,   4,  28,
       45,   4,  19,  64,  34,   0,  21,   9,  40,  70,
       16,   0, 115, 115, 115, 115, 115, 115, 115, 115,
      /* most of the table omitted */
    };
  register int hval = len;

  switch (hval)
    {
      default:
        hval += asso_values[(unsigned char)str[3]+1];
      /*FALLTHROUGH*/
      case 3:
      case 2:
      case 1:
        hval += asso_values[(unsigned char)str[0]];
        break;
    }
  return hval;
}

Now your hash function is quick and probably well-predicted (if you don't have too many strings of length 3 or less). To look up a string then you just index into the hash table (also generated by gperf), and compare what you get to the input string.

Under some reasonable assumptions this is going to be about as fast as you can get - clang generates code like this:

in_word_set:                            # @in_word_set
        push    rbx
        lea     eax, [rsi - 3]
        xor     ebx, ebx
        cmp     eax, 19
        ja      .LBB0_7
        lea     ecx, [rsi - 1]
        mov     eax, 3
        cmp     ecx, 3
        jb      .LBB0_3
        movzx   eax, byte ptr [rdi + 3]
        movzx   eax, byte ptr [rax + hash.asso_values+1]
        add     eax, esi
.LBB0_3:
        movzx   ecx, byte ptr [rdi]
        movzx   edx, byte ptr [rcx + hash.asso_values]
        cdqe
        add     rax, rdx
        cmp     eax, 114
        ja      .LBB0_6
        mov     rbx, qword ptr [8*rax + in_word_set.wordlist]
        cmp     cl, byte ptr [rbx]
        jne     .LBB0_6
        add     rdi, 1
        lea     rsi, [rbx + 1]
        call    strcmp
        test    eax, eax
        je      .LBB0_7
.LBB0_6:
        xor     ebx, ebx
.LBB0_7:
        mov     rax, rbx
        pop     rbx
        ret

It's a ton of code, but with a reasonable amount of ILP. The critical path is through the 3 dependent memory accesses (look up char value in str -> look up hash value for char in the hash function table -> look up the string in the actual hash table), you could expect this to take perhaps 20 cycles typically (plus the strcmp time of course).

Trie

The "classic" compsci solution to this problem is the trie. The trie could be a reasonable approach for your problem, especially many unsuccessful matches can be rejected quickly within the first few characters (this depends on largely on the content of the match set and the strings you are checking).

You'd want a fast trie implementation to make this work. Overall I feel this approach will be limited by the serially dependent memory accesses - each node is likely visited in a kind of pointer-chasing approach, so you'll suffer a lot of from L1 access latency.

Optimizing `strcmp`

Almost all of the above solutions depend on strcmp at some point - the exception being the bloom filter which allows false positives. So you want to make this sure this part of your code is fast.

In particular compilers may sometimes inline "builtin" versions of strcmp rather than calling the library function: in quick test icc did the inlining, but clang and gcc opted to call the library function. There is no simple rule for which one is going to be faster, but in general the library routines are often SIMD optimized, and may be faster for long strings, while the inlined versions avoid function call overhead and may be faster for short strings. You can test both approaches and mostly force the compilers to do what is faster in your case.

Even better, you may be able to take advantage of your control of the inputs to do much better - if you can ensure that, for example, the input strings will be null padded such that its length is a multiple of 8, then you can do the same for the reference strings in your hash table (or whatever other structure) and you could compare the strings 8 bytes at a time. This not only greatly speeds up the matching, it cuts back dramatically on branch mispredicts because it essentially quantizes the looping behavior (all strings of 1-8 characters loop once, etc).

¹ Here I mean desktop, server, laptop CPUs, or even modern smartphone CPUs and not embedded device MCUs or anything like that.

² Allowing false positives means that it's OK if your "is in set" sometimes returns true even when the input string is not in the set. Note that it never gets it wrong the other way around: it always returns true when the string is in the set - there are no false negatives.

³ Specifically, awk 'NR%990==0' /usr/share/dict/american-english > words.

⁴ For example, how many times do you thing strcmp has been called in the history of computing? How much time would be saved if it were even 1 ns faster?

⁵ That's kind of somehow equating CPU time with engineering time, which is probably off by a factor of more than 1000x: Amazon AWS charges something like $0.02 per hour of CPU time, and a good engineer can expect perhaps $50 per hour (in the first world). So by that (very rough!) metric engineering time is 2500x more valuable than CPU time. So perhaps you need quadrillions of calls for 10 hours of work to pay off...

Upvotes: 19

shawn

Reputation: 460

This is an interesting question because it's very close to the concept of JAVA String Pool. Java use JNI call native corresponding method which is implemented by C++

The string pool is the JVM's particular implementation of the concept of string interning:

In computer science, string interning is a method of storing only one copy of each distinct string value, which must be immutable. Interning strings makes some string processing tasks more time- or space-efficient at the cost of requiring more time when the string is created or interned. The distinct values are stored in a string intern pool.

Let's see how to implement String pool inside Java 7

/** 
 * Returns a canonical representation for the string object. 
 * <p> 
 * A pool of strings, initially empty, is maintained privately by the 
 * class <code>String</code>. 
 * <p> 
 * When the intern method is invoked, if the pool already contains a 
 * string equal to this <code>String</code> object as determined by 
 * the {@link #equals(Object)} method, then the string from the pool is 
 * returned. Otherwise, this <code>String</code> object is added to the 
 * pool and a reference to this <code>String</code> object is returned. 
 * <p> 
 * It follows that for any two strings <code>s</code> and <code>t</code>, 
 * <code>s.intern()&nbsp;==&nbsp;t.intern()</code> is <code>true</code> 
 * if and only if <code>s.equals(t)</code> is <code>true</code>. 
 * <p> 
 * All literal strings and string-valued constant expressions are 
 * interned. String literals are defined in section 3.10.5 of the 
 * <cite>The Java&trade; Language Specification</cite>. 
 * 
 * @return  a string that has the same contents as this string, but is 
 *          guaranteed to be from a pool of unique strings. 
 */  
public native String intern();

When the intern method is invoked, if the pool already contains a string equal to this String object as determined by the equal object, then the string from the pool is returned. Otherwise, this object is added to the pool and a reference to this string object is returned.

Java use JNI call native StringTable.intern method which is implemented by C++

\openjdk7\jdk\src\share\native\java\lang\String.c

Java_java_lang_String_intern(JNIEnv *env, jobject this)  
{  
    return JVM_InternString(env, this);  
}

\openjdk7\hotspot\src\share\vm\prims\jvm.h

/* 
* java.lang.String 
*/  
JNIEXPORT jstring JNICALL  
JVM_InternString(JNIEnv *env, jstring str);

\openjdk7\hotspot\src\share\vm\prims\jvm.cpp

// String support ///////////////////////////////////////////////////////////////////////////  
JVM_ENTRY(jstring, JVM_InternString(JNIEnv *env, jstring str))  
  JVMWrapper("JVM_InternString");  
  JvmtiVMObjectAllocEventCollector oam;  
  if (str == NULL) return NULL;  
  oop string = JNIHandles::resolve_non_null(str);  
  oop result = StringTable::intern(string, CHECK_NULL);
  return (jstring) JNIHandles::make_local(env, result);  
JVM_END

\openjdk7\hotspot\src\share\vm\classfile\symbolTable.cpp

oop StringTable::intern(Handle string_or_null, jchar* name,  
                        int len, TRAPS) {  
  unsigned int hashValue = java_lang_String::hash_string(name, len);  
  int index = the_table()->hash_to_index(hashValue);  
  oop string = the_table()->lookup(index, name, len, hashValue);  
  // Found  
  if (string != NULL) return string;  
  // Otherwise, add to symbol to table  
  return the_table()->basic_add(index, string_or_null, name, len,  
                                hashValue, CHECK_NULL);  
}

\openjdk7\hotspot\src\share\vm\classfile\symbolTable.cpp

oop StringTable::lookup(int index, jchar* name,  
                        int len, unsigned int hash) {  
  for (HashtableEntry<oop>* l = bucket(index); l != NULL; l = l->next()) {  
    if (l->hash() == hash) {  
      if (java_lang_String::equals(l->literal(), name, len)) {  
        return l->literal();  
      }  
    }  
  }  
  return NULL;  
}

If you want to know more about how oracle engineers change the string pooling logic in Java 7, the link will be useful to you. Bug Report: make the string table size configurable. The String pool is implemented as a fixed capacity has map with each bucket containing a list of Strings with the same has code. The default pool size is 1009.

For your question, you can write a test program to compare with this method to heap data structure and determine which one is better.

Upvotes: 3

Cybercartel

Reputation: 12592

The question is a bit vague but the fastest string matching algorithm is a finite state machine, i.e. aho-corasick algorithm. It's the generalization of e Knuth–Morris–Pratt matching algorithm. If you just want a simple lookup you can try a ternary trie or a compressed trie (radix-tree) if space matters or even binary search.

Upvotes: 2

Alexandre C.

Reputation: 56956

Use std::unordered_set<std::string>, which is well suited for your case. You can have a std::set<std::string> around if you also need to iterate them in order.

If after profiling you find out you spend all your time querying the data structure, then it will be the time to ask another question (with the precise code you'll be using).

Upvotes: 1

Striezel

Reputation: 3758

The best (and only) way to tell what structure is fastest for a certain situation is to actually benchmark/measure it with different data structures. Then pick the fastest.

Or in other words: Measuring your code gives you an advantage over those people who think they are too smart to measure. ;)

For rather small lists like the 100 elements you mentioned in your question it is not making much of a difference what structure/algorithm you use, because the time gained is probably negligible - unless that search is performed very often by your program.

Upvotes: 15

Dale Wilson

Reputation: 9434

Unless you are doing hundreds of millions of searches a second, you won't be able to tell the difference. If you ARE doing hundreds of millions of searches a second, try a radix-tree. It's very expensive in memory but with this small data set that shouldn't matter.

Once you write it, profile it.

Upvotes: 2

C++ fastest data structure for multiple searches

Answers (9)

Read the document for more details about how it search and release memory after search.

Bloom Filters

Perfect Hash

Trie

Optimizing strcmp

Related Questions

Optimizing `strcmp`