Reputation: 575
I have to iterate over a log file that contains about 100M lines. I have to do this for multiple logs. The average line length is 110 characters.
Currently I'm looping over a list of possible matches. I would like to know if there's a better way of doing this?
char *in_array(char *car) {
// longer list than this...
char *carlist[] =
{
"Avalon",
"Azera",
"Cayenne",
"Civic",
"Corolla",
"Elantra",
"F-150",
"Hilux",
"Lexus LS",
"Rav 4",
"Sienna",
// etc...
};
char *match;
int i;
int n = sizeof(carlist)/sizeof(carlist[0]);
for(i = 0; i < n; i++)
{
match = strstr(car, carlist[i]);
if(match != NULL)
{
return strdup(match);
}
}
return strdup("No match");
}
Upvotes: 0
Views: 300
Reputation: 241671
It seems likely that the bottleneck in your scan will be the cost of reading the data from the logfiles; almost any non-naive search algorithm should perform faster than a disk seek.
All the same, the brute-force algorithm in the OP is unnecessarily CPU-intensive, particularly if the word-list is long. A reasonably simple and highly-efficient algorithm, for which implementations are readily discoverable, is Aho-Corasick string matching.
If the word-list is static, you could speed that up slightly more by precomputing the transition tables for Aho-Corasick, but in this case, the difference will be negligible compared with the cost of reading the data.
Upvotes: 1
Reputation: 78815
Assuming that all your car names match at the beginning of word, you could parse each line and try to match against the array on each word boundary (using bsearch()
). That could be faster though you'll need to measure it.
static int
compare_car(const void *key, const void *elem)
{
char *car = (char *)key;
char *text = * (char **)elem;
int len = strlen(key);
return strncmp(car, text, len);
}
char *in_array(char *car) {
// longer list than this... NEEDS TO BE SORTED
char *carlist[] =
{
"Avalon",
"Azera",
"Cayenne",
"Civic",
"Corolla",
"Elantra",
"F-150",
"Hilux",
"Lexus LS",
"Rav 4",
"Sienna",
// etc...
};
char *match;
int i;
int n = sizeof(carlist)/sizeof(carlist[0]);
int len = strlen(car);
bool inword = FALSE;
for (i = 0; i < len; i++)
{
if (inword)
{
inword = isalnum(car[i]);
}
else
{
if (isalnum(car[i]))
{
inword = true;
match = bsearch(car + i, carlist, n, sizeof(char*), compare_car);
if (match != NULL)
{
return strdup(match);
}
}
}
}
return strdup("No match");
}
Upvotes: 0
Reputation: 3012
There are a couple of ways to improve this brute force search...
1) Parallelism, use OpenCL/CUDA ( C like languages) and do it on a GPU. Use a GPU thread per keyword. Read parts of the 100M line file into GPU memory and search in parallel. With OpenCL, that can run on multi-core H/W too.
2) Improve the algorithm. Use a hash look-up, compute the CRC32 of the keywords store them in a hash table and compare the hashed token from log file records to entries in the hash table.
Upvotes: 1
Reputation: 283614
A DFA (Deterministic Finite Automaton, plural Automata) can match a string against many patterns much more quickly than sequentially testing each pattern independently.
Parser generators are good at building DFA tables automatically.
The downside is that the DFA has to be rebuilt when the list of patterns changes... but since your list of patterns is currently hardcoded as an array, that shouldn't be problematic for you.
Since you are using C, flex
would be a good tool to use.
Upvotes: 5