Reputation: 4752
I am parsing files around 1MB in size, reading the first 300KB and searching for a number of particular signatures. My strategy is, for each byte, see if the byte is in a map/vector/whatever of bytes that I know might be at the start of a signature, and if so look for the full signature - for this example, assume those leading bytes are x37, x50, and x52. Processing a total of 90 files (9 files 10 times actually), the following code executes in 2.122 seconds:
byte * bp = &buffer[1];
const byte * endp = buffer + bytesRead - 30; // a little buffer for optimization - no signature is that long
//multimap<byte, vector<FileSignature> >::iterator lb, ub;
map<byte, vector<FileSignature> >::iterator findItr;
vector<FileSignature>::iterator intItr;
while (++bp != endp)
{
if (*bp == 0x50 || *bp == 0x52 || *bp == 0x37) // Comparison line
{
findItr = mapSigs.find(*bp);
for (intItr = findItr->second.begin(); intItr != findItr->second.begin(); intItr++)
{
bool bMatch = true;
for (UINT i = 1; i < intItr->mSignature.size(); ++i)
{
if (intItr->mSignature[i] != bp[i])
{
bMatch = false;
break;
}
}
if (bMatch)
{
CloseHandle(fileHandle);
return true;
}
}
}
}
However, my initial implementation finishes in a sluggish 84 seconds. The only difference is related to the line labeled "// Comparison line" above:
findItr = mapSigs.find(*bp);
if (findItr != mapSigs.end())
...
A very similar implementation using a vector containing the 3 values also results in extremely slow processing (190 seconds):
if (find(vecFirstChars.begin(), vecFirstChars.end(), *bp) != vecFirstChars.end())
{
findItr = mapSigs.find(*bp);
...
But an implementation accessing the elements of the vector directly performs rather well (8.1 seconds). Not as good as the static comparisons, but still far far better than the other options:
if (vecFirstChars[0] == *bp || vecFirstChars[1] == *bp || vecFirstChars[2] == *bp)
{
findItr = mapSigs.find(*bp);
...
The fastest implementation so far (inspired by Component 10 below) is the following, clocking in at about 2.0 seconds:
bool validSigs[256] = {0};
validSigs[0x37] = true;
validSigs[0x50] = true;
validSigs[0x52] = true;
while (++bp != endp)
{
if (validSigs[*bp])
{
...
Extending this to use 2 validSigs to look if the 2nd char is valid as well brings the total run time down to 0.4 seconds.
I feel the other implementations should perform better. Especially the map, which should scale as more signature prefixes are added, and searches are O(log(n)) vs O(n). What am I missing? My only shot-in-the-dark guess is that with the static comparisons and (to a lesser extant) the vector indexing, I am getting the values used for the comparison cached in a register or other location that makes it significantly faster than reading from memory. If this is true, am I able to explicitly tell the compiler that particular values are going to be used often? Are there any other optimizations that I can take advantage of for the below code that are not apparent?
I am compiling with Visual Studio 2008.
Upvotes: 3
Views: 294
Reputation: 275885
The "compare against constants" compares 3 memory addresses against 3 constants. This case is going to be extremely easy to do things like unroll or do bit optimization on, if the compiler feels like it. The only branches that the written ASM is going to have here are going to be highly predictable.
For the literal 3 element vector lookup, there is the additional cost of dereferencing the addresses of the vector values.
For the vector loop, the compiler has no idea how big the vector is at this point. So it has to write a generic loop. This loop has a branch in it, a branch that goes one way 2 times, then the other way. If the computer uses the heuristic "branches go the way they did last time", this results in lots of branch prediction failures.
To verify that theory, try making the branching more predictable -- search for each element for up to 100 different input bytes at a time, then search for the next one. That will make naive branch prediction work on the order of 98% of the time, instead of the 33% in your code. Ie, scan 100 (or whatever) characters for signature 0, then 100 (or whatever) characters for signature 1, until you run out of signatures. Then go on to the next block of 100 characters to scan for signatures. I chose 100 because I'm trying to avoid branch prediction failures, and I figure a few percent branch prediction failures isn't all that bad. :)
As for the map
solution, well map
s have a high constant overhead, so it being slow is pretty predictable. The main uses of a map
are dealing with large n lookups, and the fact that they are really easy to code against.
Upvotes: 0
Reputation: 137920
This is simple enough to come down to the number of instructions executed. The vector, map, or lookup table will reside entirely in the CPU level 1 data cache so memory access isn't taking up time. As for the lookup table, as long as most bytes don't match a signature prefix the branch predictor will stop flow control from taking up time. (But the other structures do incur flow control overhead.)
So quite simply, comparing against each value in the vector in turn requires 3 comparisons. The map is O(log N), but the coefficient (which is ignored by big-O notation) is large due to navigating a linked data structure. The lookup table is O(1) with a small coefficient because access to the structure can be completed by a single machine instruction, and then all that remains is one comparison against zero.
The best way to analyze performance is with a profiler tool such as valgrind/kcachegrind.
Upvotes: 1