EvilTeach
EvilTeach

Reputation: 28872

What is the fastest way to decide if a digit appears in a string?

This simple solution popped into my mind quickly.

#include <ctype.h>

int digit_exists_in
(
    const char *s
)
{
    while (*s)
    {
        if (isdigit(*s))
        {
            return 1;
        }
        else
        {
            s++;
        }
    }

    return 0;
}

int main(void)
{
    int foundDigit = digit_exists_in("abcdefg9ijklmn");

    return 0;
}

What other techniques could be applied to get better speed?

The actual strings being searched are variable length, and the characters themselves are ASCII, not the full character set. The strings are NUL terminated.

Upvotes: 6

Views: 1823

Answers (17)

rodeone2
rodeone2

Reputation: 109

Suggested fastest way to calculate isdigit is by using a #define method in a header file or above your int main().

#define IS_DIGIT(c)  ((c) >= '0' && (c) <= '9')

Because its been Pre Defined and because all numeric digit characters are numbered in binary numeric order, its already set for your if conditional statement.

if(IS_DIGIT (s)) {....}else{....}

Upvotes: 0

NickZoic
NickZoic

Reputation: 7835

liw.fi is right, by the way. I was a bit surprised by this, since strcspn has to do a much more general problem than the isdigit() approach, but it seems to be the case:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <assert.h>

#define NTESTS 10000
#define TESTSIZE 10000

char stest1[TESTSIZE];
char stest2[TESTSIZE];

int test_isdigit(char *s) {
    while (*s) {
        if (isdigit(*s)) return 1;
        s++;
    }
    return 0;
}

int test_range(char *s) {
    while (*s) {
        if ((*s >= '0') && (*s <= '9')) return 1;
        s++;
    }
    return 0;
}

int test_strcspn(char *s) {
    return s[strcspn(s, "0123456789")] != '\0';
}

int main(int argc, char **argv) {
    long int i;
    for (i=0; i<TESTSIZE; i++) {
        stest1[i] = stest2[i] = 'A' + i % 26;
    }
    stest2[TESTSIZE-1] = '5';

    int alg = atoi(argv[1]);

    switch (alg) {
        case 0:        
            printf("Testing strcspn\n");
            for (i=0; i<NTESTS; i++) {
                assert(test_strcspn(stest1) == 0);
                assert(test_strcspn(stest2) != 0);
            }
            break;
        case 1:
            printf("Testing isdigit() loop\n");
            for (i=0; i<NTESTS; i++) {
                assert(test_isdigit(stest1) == 0);
                assert(test_isdigit(stest2) != 0);
            }
            break;
        case 2:
            printf("Testing <= => loop\n");
            for (i=0; i<NTESTS; i++) {
                assert(test_range(stest1) == 0);
                assert(test_range(stest2) != 0);
            }
            break;
        default:
            printf("eh?\n");
            exit(1);
    }    

    return 0;
}

It's awfully hard to beat the standard libraries at their own game ... (the usual caveats apply -- YMMV)

$ gcc -O6 -Wall -o strcspn strcspn.c 

$ time ./strcspn 0
Testing strcspn

real    0m0.085s
user    0m0.090s
sys 0m0.000s

$ time ./strcspn 1
Testing isdigit() loop

real    0m0.753s
user    0m0.750s
sys 0m0.000s

$ time ./strcspn 2
Testing <= => loop

real    0m0.247s
user    0m0.250s
sys 0m0.000s

UPDATE: Just for fun, I added a bitmap lookup version based on Mike Dunlavey's answer:

char bitmap[256] = {
        /* 0x00 */ 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        /* 0x10 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        /* 0x20 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        /* 0x30 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
};

int test_bitmap(char *s) {
    while (!bitmap[*(unsigned char *)s]) s++;
    return (*s);
}

Which slightly outperforms the others (~.170s) but still can't touch strcspn!

Upvotes: 19

EvilTeach
EvilTeach

Reputation: 28872

Taking the test program and throwing my profiler at it yields the following.

      Count      %   % with           Time   Statement
                      child
-----------------------------------------------------------------------------------------------
                                             int test_isdigit(char *s)   
     20,000    0.0    0.0          2,160.4   {   
199,990,000   13.2    9.5     14,278,460.7       while (*s)  
                                                 {   
199,980,000   69.8   49.9     75,243,524.7            if (isdigit(*s)) return 1;  
199,970,000   17.0   12.1     18,312,971.5            s++;    
                                                 }   

     10,000    0.0    0.0          1,151.4       return 0;   
                                             }   

                                             int test_range(char *s)     
     20,000    0.0    0.0          1,734.2   {   
199,990,000   33.6    9.4     14,114,309.7       while (*s)  
                                                 {
199,980,000   32.2    9.0     13,534,938.6           if ((*s >= '0') && 
                                                        (*s <= '9')) return 1;   
199,970,000   34.2    9.5     14,367,161.9           s++;    
                                                 }   

     10,000    0.0    0.0          1,122.2       return 0;   
                                             }   

                                             int test_strcspn(char *s)   
     20,000    0.2    0.0          1,816.6   {   
     20,000   99.8    0.6        863,113.2       return s[strcspn(s, "0123456789")] 
                                                          == '0'; 
                                             }   

strcspn does the job well enough. Looking at the asm code for it, I see it builds a bitmap of size 256 sets the bits based on the search characters, then processes the string.

The bit map gets built on the stack once for every call.

One other approach would be to build and keep the bit map, and reused it each time.

Another approach would be to do the operations in parallel using the techniques that Chris Smith talked about.

For the moment strcspn will suffice.

Upvotes: 3

Mike Dunlavey
Mike Dunlavey

Reputation: 40679

Just for fun, maybe something allong the lines of:

 // accumulator
unsigned char thereIsADigit = 0;
// lookup table
unsigned char IsDigit[256] = {0,0,0 ..., 1,1,1,1,1,1,1,1,1,0,0,0 ...};

// an unrolled loop, something like:
thereIsADigit |= IsDigit[s[0]];
thereIsADigit |= IsDigit[s[1]];
thereIsADigit |= IsDigit[s[2]];
thereIsADigit |= IsDigit[s[3]];
thereIsADigit |= IsDigit[s[4]];
thereIsADigit |= IsDigit[s[5]];
thereIsADigit |= IsDigit[s[6]];
thereIsADigit |= IsDigit[s[7]];
if (thereIsADigit) break;
s += 8;

On the IBM 360 there was a "translate" instruction that could do this in one step.

OK, OK, Christopher Smith's answer got me thinking. Suppose you are only using 7-bit ASCII. Here's a way to do SIMD with wide-integer arithmetic.

Suppose C is a 32-bit word containing 4 characters.

 // compare by subtracting in 8-bit 2s complement arithmetic
( (C + ((0x3a3a3a3a ^ 0x7f7f7f7f) + 0x01010101)) // set high bit for any char <= '9'
  &
  (0x2f2f2f2f + ((C ^ 0x7f7f7f7f) + 0x01010101)) // set high bit for any char >= '0'
) // high bit is set for any char <= '9' and >= '0'
& 0x80808080 // look only at the high-order bits
// if any of these 4 bits is 1, there is a digit in C
// if the result is zero, there are no digits in C

It depends on the high-order bit of each character being initially zero, so that a carry into that bit will not propogate. (I'm sure this could be simplified.)

Upvotes: 2

dwc
dwc

Reputation: 24920

Here's a version that may or may not be faster, but it handles a NULL pointer...

int digit_exists_in(const char *s)
{
    if (!s)
        return (0);
    while (*s)
        if (isdigit(*s++))
            return (1);
    return (0);
}

Upvotes: 1

Zan Lynx
Zan Lynx

Reputation: 54345

Memory Prefetch

If your strings are very long, either get your compiler to do it or manually unroll your loop and drop in a memory prefetch instruction or two every cache-line.

That way while the CPU is scanning, the memory controller can be pulling in the next lines of data.

If you save the length of the string when you create it you can skip all the checks for the NUL byte, which means you can unroll your loop to operate in bigger chunks and reduce the number of compare and branch operations, although with current branch predictors, it honestly doesn't make much difference.

Even with great CPU branch predictors, the loop will slow down if it has to check the loop counter every time through the loop to decide when to do a memory prefetch so unrolling is still helpful in that case.

Profiling Feedback

For best performance the CPU does need to get the branches hinted correctly, and that's where profiling feedback comes in very handy. Otherwise the compiler is just making a somewhat educated guess.

Upvotes: 1

aib
aib

Reputation: 46981

Of course, you can sacrifice accuracy for speed:

int digit_exists_in(const char* s)
{
    return 0;
}

This algorithm has a complexity of O(1) and an approximacity of O((246/256)^N).

Upvotes: 0

user82238
user82238

Reputation:

I may be wrong, but there may be a faster way.

Quicksort the string.

The serial search has a best time of O(1), a mean of O(1/2n) and worse case of O(n).

Quicksort has best O(log n), mean O(nlog n), worse case O(n^2).

The thing is, you can bail out of the quicksort as soon as you see a digit. If the quicksort actually completes, the digit will be at the beginning of the sorted string, so you'll find it then in O(1).

The achievement here it to shift the best, mean and worse case behaviours. A quicksort will have worse worst case behaviour, but better mean behaviour.

Upvotes: 0

Jeff
Jeff

Reputation: 2871

Have a human look at it. Humans can do this in O(1) time, we have MUCH larger word sizes than even modern processors.

That said, actual time would still be better with your method...what with the difference in cycle time between a modern core and a human brain.

Upvotes: 0

Tim
Tim

Reputation: 20360

if you really want to reduce the overhead time and you don't mind making it specific for char, then you can check the ascii values between 0 and 9 inclusive.

48 to 57 decimal

this removes a stack call.

I should have said lookup table as well...

Upvotes: 1

user25148
user25148

Reputation:

I would start by using the appropriate library function, strcspn, on the assumption that the library has been optimized with extreme prejudice:

#include <string.h>
#include <stdio.h>

int digit_exists_in(const char *s)
{
    return s[strcspn(s, "0123456789")] != '\0';
}

int main(void)
{
    printf("%d\n", digit_exists_in("foobar"));
    printf("%d\n", digit_exists_in("foobar1"));
    return 0;
}

If the library hasn't been optimized sufficiently, it'd be a good idea to put the optimization into the library so everyone can benefit. (You have the source, right?)

Upvotes: 13

Brian R. Bondy
Brian R. Bondy

Reputation: 347426

Any algorithm will be O(N).

I assume isdigit is pretty efficient already.

Upvotes: 7

Patrick McDonald
Patrick McDonald

Reputation: 65451

The real question is how important is it for this function to be optimal? I say leave the simple solution and profile it. You should only optimize it if it is causing a bottleneck in your code.

Upvotes: 2

Christopher Smith
Christopher Smith

Reputation: 5542

There isn't a faster algorithm, but you can look at a full register's worth of bytes per instruction or use SIMD operations to speed things up. You can either use a mask and zero test to see if it is even possible if there are any digits in a range, or if your SIMD operations are fast enough on large enough vectors, you could do iterate through tests for specific numerical values in a vector of bytes faster than doing character compares.

So, for example, you could do something like:

byte buffer[8] = { 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30, 0x30 };
uint64 *mask = (uint64 *) buffer; //this is just for clarity

if (*((uint64 *) s) & *mask) == 0)
    //You now don't need to do the < '0' test for the next 8 bytes in s

Some optimizers might be smart enough to do this for you just from your code sample above.

You had better be comparing a TON of bytes to be thinking about optimizations at this level though.

Upvotes: 10

Iraimbilanja
Iraimbilanja

Reputation:

As others have said, you can't get below O(N).

I can think of a contrived scenario with logarithmic speed... Say you're writing a text editor and it needs a "does this file contain any digits" feature. You can keep a sorted array of all unique characters present in the file, update it on every keystroke, and query it with binary search. This probably lies outside the scope of your question though (and can be done in several better ways).

Upvotes: 1

Kibbee
Kibbee

Reputation: 66132

You could go multithreaded, although that's probably adding too much complexity to the algorithm for something that's already pretty fast.

Upvotes: 2

JaredPar
JaredPar

Reputation: 755169

Conceptually there is no faster way. This is assuming that you have a string which where the placement of digits is seemingly random. This forces you to search every item in the string for a digit hence front to back is as likely to find the digit first as any other search mechanism.

Upvotes: 2

Related Questions