user1024718
user1024718

Reputation: 575

Performance on a lookup array in C

I have to iterate over a log file that contains about 100M lines. I have to do this for multiple logs. The average line length is 110 characters.

Currently I'm looping over a list of possible matches. I would like to know if there's a better way of doing this?

char *in_array(char *car) {
    // longer list than this...
    char *carlist[] =
    { 
        "Avalon",
        "Azera",
        "Cayenne",
        "Civic",
        "Corolla",
        "Elantra",
        "F-150",
        "Hilux",
        "Lexus LS",
        "Rav 4",
        "Sienna",
        // etc...
    };  

    char *match;
    int i;
    int n = sizeof(carlist)/sizeof(carlist[0]);

    for(i = 0; i < n; i++)
    {
        match = strstr(car, carlist[i]);
        if(match != NULL)
        {
            return strdup(match);
        } 
    }   

    return strdup("No match");
}

Upvotes: 0

Views: 300

Answers (4)

rici
rici

Reputation: 241671

It seems likely that the bottleneck in your scan will be the cost of reading the data from the logfiles; almost any non-naive search algorithm should perform faster than a disk seek.

All the same, the brute-force algorithm in the OP is unnecessarily CPU-intensive, particularly if the word-list is long. A reasonably simple and highly-efficient algorithm, for which implementations are readily discoverable, is Aho-Corasick string matching.

If the word-list is static, you could speed that up slightly more by precomputing the transition tables for Aho-Corasick, but in this case, the difference will be negligible compared with the cost of reading the data.

Upvotes: 1

Codo
Codo

Reputation: 78815

Assuming that all your car names match at the beginning of word, you could parse each line and try to match against the array on each word boundary (using bsearch()). That could be faster though you'll need to measure it.

static int
compare_car(const void *key, const void *elem)
{
    char *car = (char *)key;
    char *text = * (char **)elem;
    int len = strlen(key);
    return strncmp(car, text, len);
}

char *in_array(char *car) {
    // longer list than this... NEEDS TO BE SORTED
    char *carlist[] =
    { 
        "Avalon",
        "Azera",
        "Cayenne",
        "Civic",
        "Corolla",
        "Elantra",
        "F-150",
        "Hilux",
        "Lexus LS",
        "Rav 4",
        "Sienna",
        // etc...
    };  

    char *match;
    int i;
    int n = sizeof(carlist)/sizeof(carlist[0]);
    int len = strlen(car);
    bool inword = FALSE;

    for (i = 0; i < len; i++)
    {
        if (inword)
        {
            inword = isalnum(car[i]);
        }
        else
        {
            if (isalnum(car[i]))
            {
                inword = true;
                match = bsearch(car + i, carlist, n, sizeof(char*), compare_car);
                if (match != NULL)
                {
                    return strdup(match);
                }
            }
        }
    }

    return strdup("No match");
}

Upvotes: 0

Tim Child
Tim Child

Reputation: 3012

There are a couple of ways to improve this brute force search...

1) Parallelism, use OpenCL/CUDA ( C like languages) and do it on a GPU. Use a GPU thread per keyword. Read parts of the 100M line file into GPU memory and search in parallel. With OpenCL, that can run on multi-core H/W too.

2) Improve the algorithm. Use a hash look-up, compute the CRC32 of the keywords store them in a hash table and compare the hashed token from log file records to entries in the hash table.

Upvotes: 1

Ben Voigt
Ben Voigt

Reputation: 283614

A DFA (Deterministic Finite Automaton, plural Automata) can match a string against many patterns much more quickly than sequentially testing each pattern independently.

Parser generators are good at building DFA tables automatically.

The downside is that the DFA has to be rebuilt when the list of patterns changes... but since your list of patterns is currently hardcoded as an array, that shouldn't be problematic for you.

Since you are using C, flex would be a good tool to use.

Upvotes: 5

Related Questions