Lasse V. Karlsen
Lasse V. Karlsen

Reputation: 391576

How do I obtain CPU cycle count in Win32?

In Win32, is there any way to get a unique cpu cycle count or something similar that would be uniform for multiple processes/languages/systems/etc.

I'm creating some log files, but have to produce multiple logfiles because we're hosting the .NET runtime, and I'd like to avoid calling from one to the other to log. As such, I was thinking I'd just produce two files, combine them, and then sort them, to get a coherent timeline involving cross-world calls.

However, GetTickCount does not increase for every call, so that's not reliable. Is there a better number, so that I get the calls in the right order when sorting?


Edit: Thanks to @Greg that put me on the track to QueryPerformanceCounter, which did the trick.

Upvotes: 7

Views: 11241

Answers (6)

Zaaier
Zaaier

Reputation: 736

Use RDTSC.

Procedures like QueryPerformanceCounter are just calling RDTSC under the hood.

RDTSC has not been dependent on dynamic frequency changes for about 16 years. From The Intel 64 and IA-32 Architectures Software Developer’s Manual (Vol. 3B):

18.17.1 Invariant TSC

The time stamp counter in newer processors may support an enhancement, referred to as invariant TSC. Processor’s support for invariant TSC is indicated by CPUID.80000007H:EDX[8]. The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is the architectural behavior moving forward. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more efficient and do not incur the overhead associated with a ring transition or access to a platform resource.

RDTSCP is a related instruction that also provides the processor ID relevant to the timestamp that was read:

18.17.2 IA32_TSC_AUX Register and RDTSCP Support

Processors based on Nehalem microarchitecture provide an auxiliary TSC register, IA32_TSC_AUX that is designed to be used in conjunction with IA32_TSC. IA32_TSC_AUX provides a 32-bit field that is initialized by privileged software with a signature value (for example, a logical processor ID).

The primary usage of IA32_TSC_AUX in conjunction with IA32_TSC is to allow software to read the 64-bit time stamp in IA32_TSC and signature value in IA32_TSC_AUX with the instruction RDTSCP in an atomic operation. RDTSCP returns the 64-bit time stamp in EDX:EAX and the 32-bit TSC_AUX signature value in ECX. The atomicity of RDTSCP ensures that no context switch can occur between the reads of the TSC and TSC_AUX values.

Support for RDTSCP is indicated by CPUID.80000001H:EDX[27]. As with RDTSC instruction, non-ring 0 access is controlled by CR4.TSD (Time Stamp Disable flag).

User mode software can use RDTSCP to detect if CPU migration has occurred between successive reads of the TSC. It can also be used to adjust for per-CPU differences in TSC values in a NUMA system.

Depending on what you're measuring, it may be appropriate to use fences in conjunction with calls to RDTSC:

The RDTSCP instruction is not a serializing instruction, but it does wait until all previous instructions have executed and all previous loads are globally visible. 1 But it does not wait for previous stores to be globally visible, and subsequent instructions may begin execution before the read operation is performed. The following items may guide software seeking to order executions of RDTSCP:

• If software requires RDTSCP to be executed only after all previous stores are globally visible, it can execute MFENCE immediately before RDTSCP.

• If software requires RDTSCP to be executed prior to execution of any subsequent instruction (including any memory accesses), it can execute LFENCE immediately after RDTSCP.

The above also applies to AMD processors:

The behavior of the RDTSC instruction is implementation dependent. The TSC counts at a constant rate, but may be affected by power management events (such as frequency changes), depending on the processor implementation. If CPUID Fn8000_0007_EDX[TscInvariant] = 1, then the TSC rate is ensured to be invariant across all P-States, C-States, and stop-grant transitions (such as STPCLK Throttling); therefore, the TSC is suitable for use as a source of time. Consult the BIOS and Kernel Developer’s Guide applicable to your product for information concerning the effect of power management on the TSC.

AMD processors have had an invariant TSC since at least as far back as Bulldozer. From The BIOS and Kernel Developer’s Guide (BKDG) for AMD Family 15h Models 30h-3Fh Processors:

8 | TscInvariant: TSC invariant. Value: 1. The TSC rate is invariant.

References:

https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html

https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/40332.pdf

https://www.amd.com/content/dam/amd/en/documents/archived-tech-docs/design-guides/25481.pdf

Upvotes: 0

prakash
prakash

Reputation: 59709

Heres an interesting article! says not to use RDTSC, but to instead use QueryPerformanceCounter.

Conclusion:

Using regular old timeGetTime() to do timing is not reliable on many Windows-based operating systems because the granularity of the system timer can be as high as 10-15 milliseconds, meaning that timeGetTime() is only accurate to 10-15 milliseconds. [Note that the high granularities occur on NT-based operation systems like Windows NT, 2000, and XP. Windows 95 and 98 tend to have much better granularity, around 1-5 ms.]

However, if you call timeBeginPeriod(1) at the beginning of your program (and timeEndPeriod(1) at the end), timeGetTime() will usually become accurate to 1-2 milliseconds, and will provide you with extremely accurate timing information.

Sleep() behaves similarly; the length of time that Sleep() actually sleeps for goes hand-in-hand with the granularity of timeGetTime(), so after calling timeBeginPeriod(1) once, Sleep(1) will actually sleep for 1-2 milliseconds,Sleep(2) for 2-3, and so on (instead of sleeping in increments as high as 10-15 ms).

For higher precision timing (sub-millisecond accuracy), you'll probably want to avoid using the assembly mnemonic RDTSC because it is hard to calibrate; instead, use QueryPerformanceFrequency and QueryPerformanceCounter, which are accurate to less than 10 microseconds (0.00001 seconds).

For simple timing, both timeGetTime and QueryPerformanceCounter work well, and QueryPerformanceCounter is obviously more accurate. However, if you need to do any kind of "timed pauses" (such as those necessary for framerate limiting), you need to be careful of sitting in a loop calling QueryPerformanceCounter, waiting for it to reach a certain value; this will eat up 100% of your processor. Instead, consider a hybrid scheme, where you call Sleep(1) (don't forget timeBeginPeriod(1) first!) whenever you need to pass more than 1 ms of time, and then only enter the QueryPerformanceCounter 100%-busy loop to finish off the last < 1/1000th of a second of the delay you need. This will give you ultra-accurate delays (accurate to 10 microseconds), with very minimal CPU usage. See the code above.

Upvotes: 13

moonshadow
moonshadow

Reputation: 89135

RDTSC output may depend on the current core's clock frequency, which for modern CPUs is neither constant nor, in a multicore machine, consistent.

Use the system time, and if dealing with feeds from multiple systems use an NTP time source. You can get reliable, consistent time readings that way; if the overhead is too much for your purposes, using the HPET to work out time elapsed since the last known reliable time reading is better than using the HPET alone.

Upvotes: 2

Seb
Seb

Reputation: 249

System.Diagnostics.Stopwatch.GetTimestamp() return the number of CPU cycle since a time origin (maybe when the computer start, but I'm not sure) and I've never seen it not increased between 2 calls.

The CPU Cycles will be specific for each computer so you can't use it to merge log file between 2 computers.

Upvotes: 2

Ken
Ken

Reputation: 2092

Use the GetTickCount and add another counter as you merge the log files. Won't give you perfect sequence between the different log files, but it will at least keep all logs from each file in the correct order.

Upvotes: 1

Greg Hewgill
Greg Hewgill

Reputation: 994221

You can use the RDTSC CPU instruction (assuming x86). This instruction gives the CPU cycle counter, but be aware that it will increase very quickly to its maximum value, and then reset to 0. As the Wikipedia article mentions, you might be better off using the QueryPerformanceCounter function.

Upvotes: 9

Related Questions