Reputation: 33
I'm trying to solve a two-dimensional Ising model with a Monte Carlo approach.
As it is slow, I used Cython to accelerate the code execution. I would like to push it even further and parallelize the Cython code. My idea is to split the two-dimensional lattice in two, so for any point on a lattice has its nearest neighbours on the other lattice. This way, I can randomly choose one lattice, and I can flip all the spins and this could be done in parallel since all those spins are independent.
So far this is my code:
(inspired from http://jakevdp.github.io/blog/2017/12/11/live-coding-cython-ising-model/)
%load_ext Cython
%%cython
cimport cython
cimport numpy as np
import numpy as np
from cython.parallel cimport prange
@cython.boundscheck(False)
@cython.wraparound(False)
def cy_ising_step(np.int64_t[:, :] field,float beta):
cdef int N = field.shape[0]
cdef int M = field.shape[1]
cdef int offset = np.random.randint(0,2)
cdef np.int64_t[:,] n_update = np.arange(offset,N,2,dtype=np.int64)
cdef int m,n,i,j
for m in prange(M,nogil=True):
i = m % 2
for j in range(n_update.shape[0]) :
n = n_update[j]
cy_spin_flip(field,(n+i) %N,m%M,beta)
return np.array(field,dtype=np.int64)
cdef cy_spin_flip(np.int64_t[:, :] field,int n,int m, float beta=0.4,float J=1.0):
cdef int N = field.shape[0]
cdef int M = field.shape[1]
cdef float dE = 2*J*field[n,m]*(field[(n-1)%N,m]+field[(n+1)%N,m]+field[n,(m-1)%M]+field[n,(m+1)%M])
if dE <= 0 :
field[n,m] *= -1
elif np.exp(-dE * beta) > np.random.rand():
field[n,m] *= -1
I tried using a prange
-constructor, but I'm having a lots of troubles with GIL-lock. I’m new to Cython and parallel computing so I could easily have missed something.
The error:
Discarding owned Python object not allowed without gil
Calling gil-requiring function not allowed without gil
Upvotes: 3
Views: 4305
Reputation: 1
Question: "How to use
prange
in cython?" . . . . + (an Epilogue on True-[PARALLEL]
True-randomness ...)
Short version: best in those and only those places, where performance gains.
Longer version:
Your problem starts not with avoiding a GIL-lock ownership, but with the physics and the Performance losses from almost computational antipatterns, irrespective of all the powers the cython
-isation may have ever enabled.
The code as-is attempts to apply a two-dimensional kernel
operator over a whole two-dimensional domain of the {-1|+1}
-spin-field[N,M]
, best in some fast and smart manner.
The actual result is incongruent with physical field Ising, because a technique of "destructive"-self-rewriting the actual-state of the field[n_,m]
right "during" a current generation of [PAR][SEQ]
-organised coverage of the two-dimensional domain of the field[:,:]
of current spin values sequentially modifies the state of the field[i,j]
, which obviously does not happen in the real-world of the recognised laws of physics. Computers are ignorant of these rules, we, humans, should prefer not to.
Next, the prange
'd attempt calls ( M * N / 2 )
-times a cdef
-ed cy_spin_flip()
in a way, that might've been easy to code, yet which is immensely inefficient, if not a performance anti-pattern testing canard to ever run this way.
If one benchmarks the costs of invoking about 1E6
-calls to a repaired, so as to become congruent with the laws of physics, cy_spin_flip()
function, one straight sees the costs of per-call overheads start matter, the more when passing them in a prange
-d fashion ( isolated, un-coordinated, memory-layout agnostic, almost atomic memory-I/O will devastate any cache / cache-line coherence ). This is an additional cost for going into naive prange
, instead of attempts to do some vectorised / block-optimised, memory-I/O smarter matrix / kernel processing.
A fast sketched, vectorised code, using a trick proposed by a Master of Vectorisation @Divakar, can produce one step per ~ 3k3 [µs]
without CPU-architecture tuning and further tweaking on spin_2Dstate[200,200]
:
The initial state is:
spin_2Dstate = np.random.randint( 2, size = N * M, dtype = np.int8 ).reshape( N, M ) * 2 - 1
# pre-allocate a memory-zone:
spin_2Dconv = spin_2Dstate.copy()
The actual const
convolution kernel is :
spin_2Dkernel = np.array( [ [ 0, 1, 0 ],
[ 1, 0, 1 ],
[ 0, 1, 0 ]
],
dtype = np.int8 # [PERF] to be field-tested,
) # some architectures may get faster if matching CPU-WORD
The actual CPU-architecture may benefit from smart-aligned data types, yet for larger two-dimensional domains ~ [ > 200, > 200 ]
users will observe growing costs due to useless amount of memory-I/O spent on 8-B-rich transfers of a principally binary { -1 | +1 }
or even more compact bitmap stored-{ 0 | 1 }
spin-information.
Next, instead of double-looping calls on each field[:,:]
-cell, rather block-update the full two-dimensional domain in one step, the helpers get:
# T[:,:] * sum(?)
spin_2Dconv[:,:] = spin_2Dstate[:,:] * signal.convolve2d( spin_2Dstate,
spin_kernel,
boundary = 'wrap',
mode = 'same'
)[:,:]
Because of the physics inside the spin-kernel properties,
this helper array will consist of only { -4 | -2 | 0 | +2 | +4 }
values.
A simplified, fast vector code:
def aVectorisedSpinUpdateSTEPrandom( S = spin_2Dstate,
C = spin_2Dconv,
K = spin_2Dkernel,
minus2betaJ = -2 * beta * J
):
C[:,:] = S[:,:] * signal.convolve2d( S, K, boundary = 'wrap', mode = 'same' )[:,:]
S[:,:] = S[:,:] * np.where( np.exp( C[:,:] * minus2betaJ ) > np.random.rand(), -1, 1 )
For cases where the Physics does not recognise a uniform probability for spin-flip to happen across the whole two-dimensional domain at a same value, replace a scalar produced from the np.random.rand()
with a two-dimensional-field-of-(individualised †)-probabilities delivered from np.random.rand( N, M )[:,:]
and this will now add some costs up to some 7k3 ~ 9k3 [us]
per a spin update step :
def aVectorisedSpinUpdateSTEPrand2D( S = spin_2Dstate,
C = spin_2Dconv,
K = spin_2Dkernel,
minus2betaJ = -2 * beta * J
):
C[:,:] = S[:,:] * signal.convolve2d( S, K, boundary = 'wrap', mode = 'same' )[:,:]
S[:,:] = S[:,:] * np.where( np.exp( C[:,:] * minus2betaJ ) > np.random.rand( N, M ), -1, 1 )
>>> aClk.start(); aVectorisedSpinUpdateSTEPrand2D( spin_2Dstate, spin_2Dconv, spin_2Dkernel, -0.8 );aClk.stop()
7280 [us]
8984 [us]
9299 [us]
wide-screen commented as-was source:
// ###################################################################### Cython PARALLEL prange / GIL-lock issues related to randomness-generator state-space management if PRNG-s are "immersed"-inside the cpython realms
# https://www.desmos.com/calculator/bgz9t3s3nm
@cython.boundscheck( False ) # https://www.desmos.com/calculator/ttz3r735qy
@cython.wraparound( False ) # https://stackoverflow.com/questions/62249186/how-to-use-prange-in-cython
def cy_ising_step( np.int64_t[:, :] field, # field[N,M] of INTs (spin) { +1 | -1 } so why int64_t [SPACE] 8-Bytes for a principal binary ? Or a complex128 for Quantum-state A*|1> + B*|0> ?
float beta # beta: a float-factor
): #
cdef int N = field.shape[0] # const
cdef int M = field.shape[1] # const
cdef int offset = np.random.randint( 0, 2 ) #_GIL-lock # const ??? NEVER RE-USED BUT IN THE NEXT const SETUP .... in pre-load const-s from external scope ??? an inital RANDOM-flip-MODE-choice-{0|1}
cdef np.int64_t[:,] n_update = np.arange( offset, N, 2, dtype = np.int64 ) # const ??? 8-B far small int-s ?? ~ field[N,M] .......... being { either | or } == [ {0|1}, {2|3}, ... , { N-2 | N-1 } ] of { (S) | [L] }
cdef int m, n, i, j # idxs{ (E) | [O] }
# #
for m in prange( M, nogil = True ): # [PAR]||||||||||||||||||||||||||||| m in M |||||||||
i = m % 2 # ||||||||||||||||||||||||| i = m % 2 ||||||||| ... { EVEN | ODD }-nodes
for j in range( n_update.shape[0] ) : # [SEQ] j over ... ||||||||| ... over const ( N / 2 )-steps ~ [0,1,2,...,N/2-1] as idx2access n_update with {(S)|[L]}-indices
# n = n_update[j] # n = n_update[j] |||||||||
# cy_spin_flip( field, ( n + i ) % N, m % M, beta ) # |||||||||
# ||||| # INCONGRUENT with PHYSICAL FIELD ISING |||||||||
# vvvvv # self-rewriting field[n_,m]"during" current generation of [PAR][SEQ]-organised coverage of 2D-field[:,:]
pass; cy_spin_flip( field, ( n_update[j] + i ) % N, m % M, beta ) # modifies field[i,j] ??? WHY MODULO-FUSED ( _n + {0|1} ) % N, _m % M ops when ALL ( _n + {0|1} ) & _m ARE ALWAYS < N, M ???? i.e. remain self ?
# # |||||||||
return np.array( field, dtype = np.int64 ) # ||||||||| RET?
#||| cy_spin_flip( ) [PAR]|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| [PERF]: all complete call-overheads are paid M*N/2 times (just to do a case-switching)
cdef cy_spin_flip( np.int64_t[:, :] field, # field[N,M] of ints (spin) { +1 | -1 } why int64_t 8-Bytes for a principal binary ? Or a complex128 for Quantum-state A*|1> + B*|0> ?
int n, # const int
int m, # const int
float beta = 0.4, # const float ? is a pure positive scalar or can also be negative ?
float J = 1.0 # const float ? is a pure positive scalar or can also be negative ? caller keeps this on an implicit, const == 1 value
):
cdef int N = field.shape[0] # const int ? [PERF]: Why let this test & assignment ever happen to happen as-many-as-N*M-times - awfully expensive, once principally avoidable...
cdef int M = field.shape[1] # const int ? [PERF]: Why let this test & assignment ever happen to happen as-many-as-N*M-times - awfully expensive, once principally avoidable...
cdef float dE = ( 2 * J * field[ n, m ] # const float [?] [PERF]: FMUL 2, J to happen as-many-as-N*M-times - awfully expensive, once principally avoidable...
*( field[( n - 1 ) % N, m ] # | (const) vvvv------------aSureSpinFLIP
+ field[( n + 1 ) % N, m ] # [?]-T[n,m]-[?] sum(?) *T *( 2*J ) the spin-game ~{ -1 | +1 } * sum( ? ) |::::|
+ field[ n, ( m - 1 ) % M] # | := {-8J |-4J | 0 | 4J | 8J }
+ field[ n, ( m + 1 ) % M] # [?] a T-dependent choice|__if_+T__| |__if_-T__| FLIP @random-scaled by 2*J*beta
)# | | # ( % MODULO-fused OPs "skew" physics - as it "rolls-over" a two-dimensional field TOPOLOGY )
) # | | #
if dE <= 0 : # | | #
field[ n, m ] *= -1 # [PERF]: "inverts" spin (EXPENSIVE FMUL instead of bitwise +1 or NumPy-efficient block-wise XOR MASK) (two-dimensional requires more efforts for best cache efficiency)
elif ( np.exp( -dE * beta ) # | | # [PERF]: with a minusBETA, one MUL uop SAVED * M * N
> np.random.rand() #__________|_____________|__________GIL-lock# [PERF]: pre-calc in the external-scope + [PHYSICS]: Does the "hidden"-SEQ-order here anyhow matter in realms of generally accepted laws of PHYSICS???
): # | | # Is a warranty of the uniform distribution "lost" by an if(field-STATE)-governed sub-stepping ????
field[ n, m ] *= -1 # identical OP ? .OR.-ed in if(): ? of a pre-generated uniform-.rand() or a general (non-sub-stepped) sequenced stepping ????
# # in a stream-of-PRNG'd SPIN-FLIP threshold floats from a warranted uniform distrib. of values ????
The beta
-controlled (given const J
) model of spin-flip thresholds for { -8 | -4 | 0 | +4 | +8 }
which are the only cases for ~ 2 * spin_2Dkernel
-convolutions across the whole two-dimensional domain of the current spin_2Dstate
, is available here: https://www.desmos.com/calculator/bgz9t3s3nm one may live-experiment with beta
to see the lowering threshold for either of possible positive outputs { + 4 | + 8 }
, as np.exp( -dE * 2 * J * beta )
is strongly controlled by beta
and the larger the beta
the lower the probability a randomly drawn number, warranted to be from a semi-closed range [0, 1)
will not dominate the np.exp()
-result.
"Normally on a true Metropolis algorithm, you flip spins (chosen randomly) one by one. As I wanted to parallelize the algorithm I flip half the spins for each iteration (when the function cy_ising_step is called). Those spins are chosen in a way that none of them are nearest neighbor as it would impact the Monte-Carlo optimization. This might not be a correct approach..."
– Angelo C 7 hours ago
Thanks for all remarks & details on method and your choices. The "most-(densely)-aggressive" spin updates by a pair of non-"intervening" lattices requires the more careful choice of strategy for sourcing the randomness.
While using the "most-aggressive" density of somehow-probable updates, the source of randomness is the core trouble - not only for the overall processing performance (a technical issue on its own how to maintain a FSA-state, if resorted to a naive, central PRNG-source).
You either design your process to be truly a randomness based (using some of the available sources of indeed nondeterministic entropy), or willing to be subordinated to a policy to allowing repeatable experiments (for re-inspection and revalidation of scientific computing), for which you have one more duty—a duty of configuration management of such scientific experiment (to record / setup / distribute / manage the initial "seeding" of all PRNG-s, that the scientific computing experiment is configured to use.
Here, given the nature warrants the spins to be mutually independent in the two-dimensional domain of the field[:,:]
, the direction of the time-arrow ought be the only direction, in which such (deterministic)-PRNG-s may retain their warranty of outputs remaining uniformly distributed over [0,1)
. As a side-effect of that, they will cause no problems for a parallelisation of their individual evolution of their respective internal states. Bingo! Computationally cheap, HPC-grade performant & robustly-random PRNG-s are a safe way for doing this (be warned, if not aware of already, not all "COTS" PRNG-s have all these properties "built-in").
That means, either of the spins will remain fair and congruent with the laws of physics if and only if it sources a spin-flip decision threshold from its "own" (thus congruently autonomous to retain the uniformity of distribution of outputs) PRNG-instance (not a problem, but a care is needed not to forget to implement it right and run it efficiently).
For a case of a need to operate an indeed non-deterministic PRNG, the source of a truly nondeterministic entropy may become a performance bottleneck, if trying to use it beyond its performance ceiling limit. A fight for a nature-like entropy is a challenging task in a domain of (no matter how large, yet still) finite-state automatons, isn't it?
Upvotes: 4
Reputation: 30912
From a Cython point-of-view the main problem is that cy_spin_flip
requires the GIL. You need to add nogil
to the end of its signature, and set the return type to void
(since by default it returns a Python object, which requires the GIL).
However, np.exp
and np.random.rand
also require the GIL, because they're Python function calls. np.exp
is probably easily replaced with libc.math.exp
. np.random
is a bit harder, but there's plenty of suggestions for C- and C++-based approaches: 1 2 3 4 (+ others).
A more fundamental problem is the line:
cdef float dE = 2*J*field[n,m]*(field[(n-1)%N,m]+field[(n+1)%N,m]+field[n,(m-1)%M]+field[n,(m+1)%M])
You've parallelized this with respect to m
(i.e. different values of m
are run in different threads), and each iteration changes field
. However in this line you are looking up several different values of m
. This means the whole thing is a race-condition (the result depends on which order the different threads finish) and suggests your algorithm may be fundamentally unsuitable for parallelization. Or that you should copy field
and have field_in
and field_out
. It isn't obvious to me, but this is something that you should be able to work out.
Edit: it does look like you've given the race condition some thought with using i%2
. It isn't obvious to me that this is right though. I think a working implementation of your "alternate cells" scheme would look something like:
for oddeven in range(2):
for m in prange(M):
for n in range(N):
# some mechanism to pick the alternate cells here.
i.e. you need a regular loop to pick the alternate cells outside your parallel loop.
Upvotes: 1