Reputation: 3692
This depends on the length of the string, so let's take 3 cases : the easiest case, the worst case and a middle one. All for a 32-bit unsigned integer. The values will be 0, 4294967295 and 67295 (half length string).
Let's say it runs on a modern CPU, like an i7 Nehalem.
I want to show how CPU intensive this operation is with concrete numbers. The algorithm is typically a small loop where one iteration needs the result of the previous one, so the code will not take advantage of superscalar CPU optimizations.
Is there any hard-wired instruction to perform this operation on a modern CPU?
EDIT : I tried to answer myself and have done some research.
Using Agner Fog's 'Instruction Tables' and code from this answer
;parameters esi is a pointer to the string, ecx the length of the string
string_to_int: ;
xor ebx,ebx ; clear ebx > 1,1
.next_digit:
movzx eax,byte[esi] ; > 1,1
inc esi ; > 1,1
sub al,'0' ; convert from ASCII to number > 1,1
imul ebx,10 ; > 1,3
add ebx,eax ; ebx = ebx*10 + eax > 1,1
loop .next_digit ; while (--ecx) > 6
mov eax,ebx ; > 1,1
ret
First and last instruction are run once. Sum of other latencies and execution is 18 per iteration. So the answer of the question should be 4+18*string.length.
It is far less than I thought. This is just for conversion and does not count all operations needed to handle a string : copy from NIC buffer to RAM, RAM to CPU cache ...
Am I counting the right things? Is there any micro optimization expert who can tell me if this looks right?
Upvotes: 0
Views: 1316
Reputation: 365537
About 7 to 10 cycles for a 5-digit integer is about right on a typical modern x86 with good scalar code, assuming the loop branch predicts the exit correctly and the ASCII bytes are already hot in L1d cache. SIMD (like SSSE3) can go faster, and can be worth it if you have a lot of integers to parse, especially if they're often not very short.
That code is not efficient. My more recent answer on the same question (NASM Assembly convert input to integer?) avoids the slow loop
instruction, and avoids imul
by using 2 lea
instructions (more like what a compiler would generate). Also it terminates on a non-digit instead of requiring a parser to have already found the length.
But more importantly, you're not accurately counting throughput or latency cost for this (those are two separate things on a CPU with out-of-order exec like Nehalem!) See How many CPU cycles are needed for each assembly instruction? and What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
Hundreds of cycles to parse a short integer that's already hot in L1d cache? What kind of chip you got in there, a Dorito? (or a 386?) Modern Microprocessors: A 90-Minute Guide! is a very good intro to CPUs that will help you understand what to do with numbers from Agner Fog's tables.
You shouldn't add up latencies for instructions that aren't part of the same loop-carried dependency chain. Out-of-order superscalar CPUs run independent instructions in parallel. Even a P5 Pentium could overlap some of that work, although imul
is slow on such old CPUs. (And movzx
is not pairable on Pentium. It's fine on anything from this century, though.)
Also, you have the wrong throughput numbers. Independent inc
and sub
instructions can run at 3 per clock on Nehalem, reciprocal throughput = 0.33 from Agner's table. Maybe you were counting uops instead of throughput? You can only add throughput costs after checking whether they compete for the same execution unit or not, so just a throughput number isn't very useful. (latency vs throughput in intel intrinsics)
https://uica.uops.info predicts that Sandybridge will run just the loop part (after replacing loop
with dec ecx
/ jnz
) at 4 cycles per iteration, just bottlenecked on the latency of the dependency chain for EBX through imul
and add
.
But OoO exec can overlap that with surrounding code for short numbers (although if the loop branch mispredicts on the last iteration, when it falls through instead of jumping, execution will probably be done with the old work by the time it starts executing later instructions, except for very long numbers). Anyway, the throughput cost is just 2 cycles per iteration. Nehalem should be similar modulo front-end bottlenecks (but it does have a loop buffer like SnB), although would have a partial-register merging stall when reading EAX after sub al, '0'
. But that's trivially fixable with sub eax, '0'
like you'd find in real compiler-generated code.
My optimized version (other answer on the same linked question) has a loop that's the same number of front-end uops (6), but uses 2x lea
instead of imul/add
, so the latency bottleneck is down to 2 cycles. And the loop condition is cmp/jbe
to check for non-digit characters. cmp/jbe
can macro-fuse into a single compare-and-branch uop on Nehalem, and modern AMD. Unlike dec ecx/jnz
which can only macro-fuse on Intel Sandybridge-family.
See also How to implement atoi using SIMD? and projects like SIMDJSON; Dan Lemire wrote about it: https://lemire.me/blog/2022/05/25/parsing-json-faster-with-intel-avx-512/. (SIMDJSON uses SIMD for much more than just string to integer, also for finding starts/ends of blocks and validating UTF-8 and stuff like that.)
Upvotes: 1
Reputation: 3274
Interpreting the contents of an XML file consumes copious amounts of CPU cycles. A large one will require CPU seconds or longer to parse. If you keep the large XML file as your database of values just finding the data will, on average, take millions of times longer than converting the data to this or that numeric format.
Ergo, (if possible) do a one-time conversion of the file to a vector, matrix or other structure of binary values you can easily index into. If indexing is not possible just finding the data in the vector will be a lot faster than doing the same in the XML file. In any case even in the one-time conversion the job of finding the data in the XML file will be much greater than converting once you've found it.
As to your question itself I would have guessed 10 (closer) to 15 cycles per loop. I have read that the loop[cond] instruction is a carryover from earlier processor generations that may have stopped being useful around when the Pentium was introduced. Assembly output from gcc confirms that it is seldom used. It has - as I understand it - to do with the instruction not being easily reordered and that it tests status flags which might not be available when the instruction starts being executed, leading to processor stalls. What its result will be (jump taken) should be fairly predictable though.
Upvotes: 2
Reputation: 3917
The algorithm you have to convert a string into an integer value is the generally accepted algorithm (32 bit). However, there are more than one algorithm to convert an integer value to a string (not to mention instruction sets, microarchitectures, cache sizes, etc...). Even if you limit every one of those things, there isn't any one single answer to that question.
Although, I think this may be a case of premature optimization. If I understand correctly, you're concerned with the additional overhead introduced by a non-binary protocol. Binary protocols are generally an extreme measure to take to increase performance. It's also generally done to limit latency, and not to increase throughput.
There are a lot of benefits to text protocols you're forced to give up in using binary protocols (compression characteristics, ease of use, ease to debug, etc...). You also need to consider that not all CPU architectures are little-endian (network byte-order is specifically big-endian). Make sure the protocol is the bottleneck before you optimize.
Upvotes: 2