Reputation: 31
As the title advances, we would like to get some advice on the fastest algorithm available for pattern matching with the following constrains:
Long dictionary: 256
Short but not fixed length rules (from 1 to 3 or 4 bytes depth at most)
Small (150) number of rules (if 3 bytes) or moderate (~1K) if 4
Better performance than current AC-DFA used in Snort or than AC-DFA-Split again used by Snort
Software based (recent COTS systems like E3 of E5) Ideally would like to employ some SIMD / SSE stuff due to the fact that currently they are 128 bit wide and in near future they will be 256 in opposition to CPU's 64
We started this project by prefiltering Snort AC with algorithm shown on Sigmatch paper but sadly the results have not been that impressive (~12% improvement when compiling with GCC but none with ICC)
Afterwards we tried to exploit new pattern matching capabilities present in SSE 4.2 through IPP libraries but no performance gain at all (guess doing it directly in machine code would be better but for sure more complex)
So back to the original idea. Right now we are working along the lines of Head Body Segmentation AC but are aware unless we replace the proposed AC-DFA for the head side will be very hard to get improved performance, but at least would be able to support much more rules without a significant performance drop
We are aware using bit parallelism ideas use a lot of memory for long patterns but precisely the problem scope has been reduce to 3 or 4 bytes long at most thus making them a feasible alternative
We have found Nedtries in particular but would like to know what do you guys think or if there are better alternatives
Ideally the source code would be in C and under an open source license.
IMHO, our idea was to search for something that moved 1 byte at a time to cope with different sizes but do so very efficiently by taking advantage of most parallelism possible by using SIMD / SSE and also trying to be the less branchy as possible
I don't know if doing this in a bit wise manner or byte wise
Back to a proper keyboard :D
In essence, most algorithms are not correctly exploiting current hardware capabilities nor limitations. They are very cache inneficient, very branchy not to say they dont exploit capabilities now present in COTS CPUs that allow you to have certain level of paralelism (SIMD, SSE, ...)
This is preciselly what we are seeking for, an algorithm (or an implementation of an already existing algorithm) that properly considers all that, with the advantag of not trying to cover all rule lengths, just short ones
For example, I have seen some papers on NFAs claming that this days their performance could be on pair to DFAs with much less memory requirements due to proper cache efficiency, enhanced paralelism, etc
Upvotes: 3
Views: 837
Reputation: 33
Please take a look at: http://www.slideshare.net/bouma2 Support of 1 and 2 bytes is similar to what Baxter wrote above. Nevertheless, it would help if you could provide the number of single-byte and double-byte strings you expect to be in the DB, and the kind of traffic you are expecting to process (Internet, corporate etc.) - after all, too many single-byte strings may end up in a match for every byte. The idea of Bouma2 is to allow the incorporation of occurrence statistics into the preprocessing stage, thereby reducing the false-positives rate.
Upvotes: 2
Reputation: 95392
It sounds like you are already using hi-performance pattern matching. Unless you have some clever new algorithm, or can point to some statistical bias in the data or your rules, its going to be hard to speed up the raw algorithms.
You might consider treating pairs of characters as pattern match elements. This will make the branching factor of the state machine huge but you presumably don't care about RAM. This might buy you a factor of two.
When running out of steam algorithmically, people often resort to careful hand coding in assembler including clever use of the SSE instructions. A trick that might be helpful to handle unique sequences whereever found is to do a series of comparisons against the elements and forming a boolean result by anding/oring rather than conditional branching, because branches are expensive. The SSE instructions might be helpful here, although their alignment requirements might force you to replicate them 4 or 8 times.
If the strings you are searching are long, you might distribute subsets of rules to seperate CPUs (threads). Partitioning the rules might be tricky.
Upvotes: 1