Reputation: 15
numpy loop is ok.
cupy loop 1 time, 3 times is ok. but 10 times makes error.
how can i fix this problem?
is this gpu memory problem?
(source code)
import cupy as cp
import numpy as np
mc = 5000
def fcal(ff, nloop, skey):
maa = ff.zeros((mc,mc)) + 0.0
mbb = ff.zeros((mc,mc)) + 0.0
for jj in range(nloop): maa = ff.dot(maa, mbb)
asum = ff.sum(maa)
print("[fcal] (%s) nloop=[%2d] asum=[%s]" % (skey, nloop, asum))
fcal(np, 1, "np")
fcal(np, 3, "np")
fcal(np, 10, "np")
fcal(cp, 1, "cp")
fcal(cp, 3, "cp")
fcal(cp, 10, "cp")
(execution result)
[fcal] (np) nloop=[ 1] asum=[0.0]
[fcal] (np) nloop=[ 3] asum=[0.0]
[fcal] (np) nloop=[10] asum=[0.0]
[fcal] (cp) nloop=[ 1] asum=[0.0]
[fcal] (cp) nloop=[ 3] asum=[0.0]
Traceback (most recent call last):
File "C:\testdir\2cupy_test.py", line 30, in <module>
fcal(cp, 10, "cp")
File "C:\testdir\2cupy_test.py", line 22, in fcal
print("[fcal] (%s) nloop=[%2d] asum=[%s]" % (skey, nloop, asum))
File "cupy\core\core.pyx", line 1596, in cupy.core.core.ndarray.__str__
File "cupy\core\core.pyx", line 1643, in cupy.core.core.ndarray.get
File "cupy\cuda\memory.pyx", line 372, in cupy.cuda.memory.MemoryPointer.copy_to_host
File "cupy\cuda\runtime.pyx", line 255, in cupy.cuda.runtime.memcpy
File "cupy\cuda\runtime.pyx", line 135, in cupy.cuda.runtime.check_status
cupy.cuda.runtime.CUDARuntimeError: cudaErrorLaunchFailure: unspecified launch failure
Upvotes: 1
Views: 287
Reputation: 722
There's no problem in your code: each iteration is independent from the other as you sum up zeros in a sequential mode. If you can run it without error using a single iteration, than your problem is not in the code implementation.
You are probably getting into an TDR error as pointed out in comments by Robert Crovella, since more iterations can delay response time of your GPU to the querying OS.
I you want to check if you're really getting into a TDR problem, supposing one iteration runs without problems, try to add a simple sleep of some seconds between each ff.dot operation in order to let the OS receive a response from the GPU.
I stress that this is not a solution to the TDR problem, but a simple way to detect if you're getting into it.
import time
...
for jj in range(nloop):
maa = ff.dot(maa, mbb)
time.sleep(10)
Upvotes: 2