Reputation: 786

Fast string comparison in C

I currently have this kind of loop

while(1)
{
    generate_string(&buffer);

    for(int i = 0; i < filelines; i++)
    {
        if(strcmp(buffer,line[i]) == 0)
        {
           /*  do something  */
        }
    }
}

I have a file with a few million strings(which hopefully should be cut by half sometime soon), the number of all these strings is stored in filelines

line[i] is basically where the string itself is stored.

Currently, due to the comparison of these million strings, function generate_string(&buffer); is executed around 42 times per second. Is there a faster way to do string comparison in C?

Upvotes: 11

Answers (11)

dirkgently

Reputation: 111130

strcmp is usually optimized by all vendors. However, if you're not satisfied with this you can try:

Lookup Burst Tries
Use a suffix tree for fast string comparison -- see this article
Depending on the size of strings in your application you can write a custom string comparator. E.g: GNU libc used to have this optimization for small strings where they tested strings smaller than five bytes as integers. MS cl also has some optimizations for small-strings (do look it up).

But more importantly make sure strcmp is your real bottleneck.

Upvotes: 17

JuniorCE

Reputation: 51

Optimization of Computer Programs in C

You can save a little time by checking the first characters of the strings in question before doing the call. Obviously, if the first characters differ, there's no reason to call strcmp to check the rest. Because of the non-uniform distribution of letters in natural languages, the payoff is not 26:1 but more like 15:1 for uppercase data.

#define QUICKIE_STRCMP(a, b)  (*(a) != *(b) ? \  
  (int) ((unsigned char) *(a) - \
         (unsigned char) *(b)) : \
  strcmp((a), (b)))

If The dictionary of words you are using are well defined (meaning you don't mind return value form strcmp but the 0==equal), for example, a set of command line arguments that begins with same prefix, ex: tcp-accept, tcp-reject than you can rewrite the macro and do some pointer arithmetic to compare not the 1st one but the Nth char, in this case, the 4th char, ex:

   #define QUICKIE_STRCMP(a, b, offset) \
            (*(a+offset) != *(b+offset))\ ? -1 : strcmp((a), (b)))

Upvotes: 5

Harsath

Reputation: 173

You can use a byte-wise comparator macro instead of strcmp() to achieve a very fast string comparison (of standard 8-bit char) if you know the string length beforehand. I benchmarked the byte-wise comparator macro against glibc's strcmp(), and the macro version significantly outperformed strcmp() implementation; it takes advantage of the CPU's vector processor.

Example:

#define str3_cmp(x, y0, y1, y2, y3) x[0] == y0 && x[1] == y1 && x[2] == y2 && x[3] == y3
static inline bool str3_cmp_helper(const char *x, const char *y) {
    return str3_cmp(x, *y, *(y + 1), *(y + 2), *(y + 3));
}

const char *i = "hola"; // dynamically generated (eg: received over a network)

if (str3_cmp_helper(i, "hola")) {
     /* do something */ 
} else { 
    /* do something else */ 
}

However, writing such a macro is tiresome, so I have included a PHP script to generate the macro. This script takes two arguments, (1) the string length to be compared (this argument is variadic so write as many macros as you want), and (2) the output filename.

#!/usr/bin/php
<?php
function generate_macro($num) : string {
        $returner = "#define str".$num."cmp_macro(ptr, ";
        for($x = 0; $x < $num; $x++){
                $returner .= "c".$x;
                if($x != $num-1){ $returner .= ", "; }
        }
        $returner .= ") ";
        for($x = 0; $x < $num; $x++){
                $returner .= "*(ptr+".$x.") == c".$x;
                if($x != $num-1){ $returner .= " && "; }
        }
        return $returner;
}
function generate_static_inline_fn(&$generated_macro, $num) : string {
        $generated_macro .= "static inline bool str".$num."cmp(const char* ptr, const char* cmp)".
                                "{\n\t\treturn str".$num."cmp_macro(ptr, ";
        for($x = 0; $x < $num; $x++){
                $generated_macro .= " *(cmp+".$x.")";
                if($x != $num-1){ $generated_macro .= ", "; }
        }
        $generated_macro .= ");\n}\n";
        return $generated_macro;
}

function handle_generation($argc, $argv) : void {
        $out_filename = $argv[$argc-1];
        $gen_macro = "";
        for($x = 0; $x < $argc-2; $x++){
                $macro = generate_macro($argv[$x+1])."\n";
                $gen_macro .= generate_static_inline_fn($macro, $argv[$x+1]);
        }
        file_put_contents($out_filename, $gen_macro);
}
handle_generation($argc, $argv);
?>

Script example: $ ./gen_faststrcmp.php 3 5 fast_strcmp.h.

This generates fast_strcmp.h with macros for comparing strings of length 3 and 5:

#define str3cmp_macro(ptr, c0, c1, c2) *(ptr+0) == c0 && *(ptr+1) == c1 && *(ptr+2) == c2
static inline bool str3cmp(const char* ptr, const char* cmp){
                return str3cmp_macro(ptr,  *(cmp+0),  *(cmp+1),  *(cmp+2));
}
#define str5cmp_macro(ptr, c0, c1, c2, c3, c4) *(ptr+0) == c0 && *(ptr+1) == c1 && *(ptr+2) == c2 && *(ptr+3) == c3 && *(ptr+4) == c4
static inline bool str5cmp(const char* ptr, const char* cmp){
                return str5cmp_macro(ptr,  *(cmp+0),  *(cmp+1),  *(cmp+2),  *(cmp+3),  *(cmp+4));
}

You can use the macro like so:

const char* compare_me = "Hello";
if(str5cmp(compare_me, "Hello")) { /* code goes here */ }

Upvotes: 1

user15023244

Reputation:

Use strcmp for regular strings. But if the string if really long you can use memcmp. It will compare chunks of memory.

Upvotes: 0

Jacob

Reputation: 23

It depends on the length of the string.

If it's not too long, you can try to compare byte by byte:

str[0] == str2[0] && str[1] == str2[1] && str[2] == str2[2]

Otherwise, use memcmp(), it compares chunks of memory.

Upvotes: 0

Steve Jessop

Reputation: 279255

You're already compiling with optimization, right?

If you have a Trie or hashtable data structure lying around the place, ready to use, then you should.

Failing that, a fairly easy change that will probably speed things up is to sort your array line once, before you start generating strings to search for. Then binary search for buffer in the sorted array. It's easy because the two functions you need are standard -- qsort and bsearch.

A binary search into a sorted array only needs to do about log₂(filelines) string comparisons, instead of about filelines. So in your case that's 20-something string comparisons per call to generate_string instead of a few million. From the figures you've given, I think you can reasonably expect it to go 20-25 times faster, although I promise nothing.

Upvotes: 1

Brian McFarland

Reputation: 9422

I can assure you, the function strcmp is ABSOLUTELY NOT the bottleneck. Typically, strcmp is well optimized and can do 32 or 64 bit comparisons for strings longer than 4/8 bytes depending on architecture. Both newlib and GNU libc do this. But even if you were to look at each byte in both strings 20 times, it doesn't matter as much as the algo & data structure choices made here.

The real bottle neck is the O(N) search algorithm. A single O(N log N) pass at the file could be used to at appropriate data structure (whether it's a normal BST, a trie, or just a simple sorted array) for doing O(log N) lookups.

Bear with me here--a lot of math follows. But I think this is a good opportunity to illustrate why choice of algorithm & data structure are sometimes FAR more important than method of string comparison. Steve touches on this, but I wanted to explain it in a little more depth.

With N=1e6, log(1e6, 2) = 19.9, so round up to 20 comparisons on an ideal data structure.

Currently you're doing a a worst case search of O(N), or 1e6 operations.

So say you just build a red-black tree with O(log N) insertion time, and you insert N items, that's O(N log N) time to build the tree. So that's 1e6 x 20 or 20e6 operations necessary to build your tree.

In your current approach, building the data structure is O(N), or 1e6 operations, but your worst case search time is O(N) as well. So by the time you read the file and do just 20 search operations, you're up to a theoretical worst case of 21,000,000 operations. By comparison, your worst case with a red-black tree and 20 searches is 20,000,400 operations, or 999,600 operations BETTER than the O(N) search on an unsorted array. So at 20 searches, you're at the first point where a more sophisticated data structure really pays off. But look at what happens at 1000 searches:

Unsorted array = initialization + 1000 x search time = O(N) + 1000 * O(N) = 1,000,000 + 2,000,000,000 = 2,001,000,000 operations.

Red-black = initialization + 1000 x search time = O(N log N) + 1000 * O(log N) = 20,000,000 + 20,000 = 20,020,000 operations.

2,001,000,000 / 20,020,000 ~= 100x as many operations for the O(N) search.

At 1e6 searches, that's (1e6 + 1e6 * 1e6) / (20e6 + 1e6 * 20 ) = 25,000x as many operations.

Assume your computer can handle the 40e6 'operations' it takes to do the log N searches in 1 minute. It would take 25,000 minutes, or 17 DAYS to do the same work with your current algorithm. Or another way to look at is that the O(N) search algorithm can only handle 39 searches in the time the O(log N) algorithm can do 1,000,000. And the more searches you do, the uglier it gets.

See responses from Steve and dirkgently for several better choices of data structures & algorithms. My only additional caution would be that qsort() suggested by Steve might have a worst-case complexity of O(N*N), which is far, far, worse than the O(N log N) you get with a heapsort or various tree-like structures.

Upvotes: 6

Ivaylo Strandjev

Reputation: 70929

If I get your question correctly, you need to check if a string is along all the lines read so far. I would propose using a TRIE or even better a Patricia tree from the file lines. This way instead of going all over all the lines you can check linearly if your string is present(and with a little more effort - where).

Upvotes: 2

justin

Reputation: 104698

you may be able to get by with a binary comparison in this case because your program does not actually sort, but compares for equality.

you can also improve comparison speeds here by determining the lengths in advance (provided of course they vary enough). when the length does not match here, do something will not happen.

of course, hashing here would be another consideration depending on how many times you read the hashed value.

Upvotes: 0

Art Swri

Reputation: 2804

You can try something 'cheap' like screening based on the first char. If the first chars don't match, the strings cannot be equal. If they match, then call strcmp to compare the entire string. You may wish to consider a better algorithm if that is appropriate for your situation; examples would be sorting the file/lines and doing a binary search, using a hash table, or similar string table techniques.

Upvotes: 1

Ted Hopp

Reputation: 234795

I don't know that there's a faster way than calling strcmp to do string comparisons, but you can perhaps avoid calling strcmp so much. Use a hash table to store your strings and then you can check whether the string in buffer is in the hash table. If the index of a hit is important when you "do something", the table can map strings to indexes.

Upvotes: 0

Fast string comparison in C

Answers (11)

Related Questions