Reputation: 706
I’m having a problem writing some C++ AMP code. I have included a sample. It runs fine on emulated accelerators but crashes the display driver on my hardware (windows 7, NVIDIA GeForce GTX 660, latest drivers) but I can see nothing on wrong with my code.
Is there a problem with my code or is this a hardware/driver/complier issue?
#include "stdafx.h"
#include <vector>
#include <iostream>
#include <amp.h>
int _tmain(int argc, _TCHAR* argv[])
{
// Prints "NVIDIA GeForce GTX 660"
concurrency::accelerator_view target_view = concurrency::accelerator().create_view();
std::wcout << target_view.accelerator.description << std::endl;
// lower numbers do not cause the issue
const int x = 2000;
const int y = 30000;
// 1d array for storing result
std::vector<unsigned int> resultVector(y);
Concurrency::array_view<unsigned int, 1> resultsArrayView(resultVector.size(), resultVector);
// 2d array for data for processing
std::vector<unsigned int> dataVector(x * y);
concurrency::array_view<unsigned int, 2> dataArrayView(y, x, dataVector);
parallel_for_each(
// Define the compute domain, which is the set of threads that are created.
resultsArrayView.extent,
// Define the code to run on each thread on the accelerator.
[=](concurrency::index<1> idx) restrict(amp)
{
concurrency::array_view<unsigned int, 1> buffer = dataArrayView[idx[0]];
unsigned int bufferSize = buffer.get_extent().size();
// needs both loops to cause crash
for (unsigned int outer = 0; outer < bufferSize; outer++)
{
for (unsigned int i = 0; i < bufferSize; i++)
{
// works without this line, also if I change to buffer[0] it works?
dataArrayView[idx[0]][0] = 0;
}
}
// works without this line
resultsArrayView[0] = 0;
});
std::cout << "chash on next line" << std::endl;
resultsArrayView.synchronize();
std::cout << "will never reach me" << std::endl;
system("PAUSE");
return 0;
}
Upvotes: 3
Views: 776
Reputation: 646
It is very likely that your computation exceeds permitted quantum time (default 2 seconds). After that time the operating systems comes in and restarts the GPU forcefully, this is called Timeout Detection and Recovery (TDR). The software adapter (reference device) does not have the TDR enabled, that is why the computation can exceed permitted quantum time.
Does your computation really require 3000 threads (variable x), each performing 2000 * 3000 (x * y) loop iterations? You can chunk your computation, such that each chunks takes less than 2 seconds to compute. You can also consider disabling TDR or exceeding the permitted quantum time to fit your need.
I highly recommend reading a blog post on how to handle TDRs in C++ AMP, which explains TDR in details: http://blogs.msdn.com/b/nativeconcurrency/archive/2012/03/07/handling-tdrs-in-c-amp.aspx
Additionally, here is the separate blog post on how to disable the TDR on Windows 8: http://blogs.msdn.com/b/nativeconcurrency/archive/2012/03/06/disabling-tdr-on-windows-8-for-your-c-amp-algorithms.aspx
Upvotes: 8