Reputation: 41
If I run a long-running kernel on a GPU device, after 2 seconds (by default) the windows TDR (Timeout Detection and Recovery) will kill the running kernels. I understand it, but what if you can't predict how long the kernel will run, because you need to do lots of computations and neither you know the capacity/speed of the underlying GPU for the actual user, who runs your program?
What are the best practices for solving this problem?
I found 3 ways to prevent it to happen, but none of those seems a good solution for me:
You need to make sure that your kernels are not too time-consuming: The kernel is time consuming and though I could do some kind of fragmentation and not run 1 million of them but 2*500k or 4*250k, but I still can't predict if it will fit into the default 2 seconds on the actual user's GPU. (I had the idea to half the number until your kernel won't drop a CL_INVALID_COMMAND_QUEUE error, and then you just call it multiple times with the smaller amount, but to be honest it sounds really hackie and have some other drawbacks.)
You can turn-off the watchdog timer (or increase the delay): Timeout Detection and Recovery of GPUs: It's done by registry edit, and you need to restart Windows to make it effective. You can't do it on a user's machine.
You can run the kernel on a GPU that is not hooked up to a display: How can you make sure the GPU is not hooked up to a display on a users machine? Even in my laptop my primary GPU is the Intel HD4000 and the NVidia GPU is not in use for display (I think so), but TDR still kills my kernels.
Upvotes: 1
Views: 2679
Reputation: 6343
You listed all of the solutions I know of. Since solution 2 leaves the machine in an unusable state while your kernel runs (not a good practice) it should be avoided. Since adding another GPU (solution 3) is not practical for you, your best bet is to focus on solution 1. I don't know why you are trying to maximize the work size to run as long as possible to avoid TDR. You should instead target around 10 ms or less (if you run many kernels that take longer the GUI is very sluggish). So instead of 4*250000, think more like 400*2500. You may need to put in some clFinish calls between each one (or batch of 10, or whatever). Keeping the execution time small (10 ms) and not overfilling the queue will allow the GPU to do other things in between kernels and you won't get TDR resets nor make the machine unusable and yet the GPU will be quite busy.
Upvotes: 2