Kevin Tan
Kevin Tan

Reputation: 61

Clock Cycles for the invlpg instruction

I was reading some documentation about the invlpg instruction for Intel Pentium processors and it says that it takes 25 clock cycles. I thought that this depended on the implementation (the particular CPU) and not the actual instruction set architecture? Or is the fact that this instruction must take 25 clock cycles to run also part of the instruction set specification?

Upvotes: 0

Views: 215

Answers (2)

Peter Cordes
Peter Cordes

Reputation: 364220

That number is not part of any official ISA documentation, it's just performance data that someone annotated into an old (then-current) copy of Intel's ISA docs.

It's from some random microarchitecture, presumably P5 Pentium that was relevant back when Tripod was a widely used web host, and which that guide labels itself as documenting. (These days there are Pentium/Celeron CPUs that are just cut-down versions of i3/i5/i7 of the same generation, with stuff like AVX and BMI1/2 disabled. But Pentium used to refer to the P5 microarchitecture.)

It's not from Intel's documentation; it was added by whoever compiled that HTML. The formatting is similar to modern versions of Intel's vol.2 x86 SDM instruction-set reference manual. You can find HTML extracts of that at https://github.com/HJLebbink/asm-dude/wiki/INVLPG and https://www.felixcloutier.com/x86/invlpg for example. The encoding / mnemonic / description table at the top has identical formatting in your Tripod link, but the actual text is somewhat different. Also, the text for inc (current Intel vs. tripod) is word for word identical.

So yes, this is based on an old PDF->HTML of Intel's vol.2 manual, with P5 cycles and instruction-pairing info added (inc pairs in the U or V pipe on that dual-issue in-order pipeline that doesn't break instructions down into uops). Also with FLAGS updating section turned into tables.

That instruction-pairing and cycle-count info is totally irrelevant when tuning for modern microarchitectures like Skylake and Zen, but you can find it in Agner Fog's instruction tables: his spreadsheet has a sheet for P5, as well as for later Intel, AMD, and Via microarchitectures. (Also see his optimization guide and microarch pdf for background info to help you make sense of uops / ports / latency / throughput info.) Agner doesn't test most kernel instructions so invlpg isn't in his list.

http://faydoc.tripod.com/cpu/index.htm is obviously not an official Intel source. IDK where the author of this got their info from. Maybe they tested themselves. Or Intel has sometimes published some timing numbers for some microarchitectures, e.g. as part of their optimization manual. This is totally separate from the x86 ISA manuals, and is not something you can rely on for correctness. And other people have published their test results.


Another good source for experimental test results of instruction performance (uops for which ports, latency, and throughput) is https://uops.info/. Their testing for invlpg m8 shows it has a back-to-back throughput of ~194 cycles in practice on Skylake-client, ~157 on Nehalem, and ~126.25 on Zen+ and Zen2, to pick some random examples. But it may interleave better with other instructions, taking "only" 47 front-end uops on recent Intel CPUs and thus can issue in under 12 cycles if the back-end has room in the ROB / RS, maybe letting later instructions execute while the invlpg operation is in progress. (Although if it takes over 100 cycles for its uops to retire, that will often stall OoO exec at some point for a fraction of the total time.)

Remember that instruction performance can't be characterized by a single number on out-of-order CPUs; it's not one dimensional. Perf analysis is not as simple as adding up a cycle costs for all instructions in a loop, you have to analyze how the can overlap with each other. Or for complex cases like invlpg, measure.

Upvotes: 3

David Schwartz
David Schwartz

Reputation: 182759

The documentation is saying that it took 25 clock cycles on the Pentium. The number of clock cycles the instruction takes on other CPUs may be more or fewer. The performance of instructions is not part of the instruction set specification.

Upvotes: 3

Related Questions