Reputation: 165
I have a program that has a main thread and a second thread. The second thread modifies a global variable which then will be used in the main thread. But somehow the changes I make in the second thread are not shown in the main thread.
section .bss USE32
global var
var resd 1
section .text USE32
..start:
push 0
push 0
push 0
push .second
push 0
push 0
call [CreateThread]
mov eax, 1
cmp [var], eax ; --> the content of var and '1' are not the same. Which is confusing since I set the content of var to '1' in the second thread
;the other code here is not important
.second:
mov eax, 1
mov [var], eax
ret
(This is a simplification of my real program which creates threads in a loop; I haven't tested this exact code.)
Upvotes: 1
Views: 192
Reputation: 365157
You don't join
the new thread (wait for it to exit); there's no reason to assume that it's finished (or even fully started) when CreateThread returns to the main thread.
You could spin-wait until you see a non-zero value in [var]
, and count how many iterations that takes, if you want to benchmark thread-startup overhead + inter-core latency.
...
call [CreateThread]
mov edi, 1
cmp [var], edi
je .zero_latency ; if var already changed
rdtsc ; could put an lfence before and/or after this to serialize execution
mov ecx, eax ; save low half of EDX:EAX cycle count; should be short enough that the interval fits in 32 bits
xor esi, esi
.spin:
inc esi ; ++spin_count
pause ; optional, but avoids memory-order mis-speculation when var changes
cmp [var], edi
jne .spin
rdtsc
sub eax, ecx ; reference cycles since CreateThread returned
...
.zero_latency: ; jump here if the value already changed before the first iteration
Note that rdtsc
measures in reference cycles, not core clock cycles, so turbo matters. Only doing the low 32 bits of the 64-bit subtraction is fine if the interval is less than 2^32 (e.g. about 1 second on a CPU with a reference frequency of 4.2 GHz, vastly longer than we'd expect here).
esi
is the spin count. With pause
in the loop, you'll do about one check per 100 cycles on Skylake and later, or about one check per 5 cycles on earlier Intel. Otherwise about one check per core clock cycle.
Upvotes: 3