Reputation: 155
When I have a kernel on the top loop, Why I can't use these 2 directives:
#pragma acc update device(hbias[0:n_hidden],W[0:n_hidden][0:n_visible])
#pragma acc update device(vbias[0:n_visible)
I need to update these variables hbias
, vbias
, W
in below code, but it won't work:
void RBM::contrastive_divergence(int train_X[6][6], double learning_rate, int k) {
double r= rand() / (RAND_MAX + 1.0);
int * input = new int[n_visible];
double *ph_mean = new double[n_hidden];
int *ph_sample = new int[n_hidden];
double *nv_means = new double[n_visible];
int *nv_samples = new int[n_visible];
double *nh_means = new double[n_hidden];
int *nh_samples = new int[n_hidden];
#pragma acc kernels
for (int i = 0; i<train_N; i++) {
for (int j = 0; j< n_visible; j++){
input[j] = train_X[i][j];
}
sample_h_given_v(input, ph_mean, ph_sample,r);
for (int step = 0; step<k; step++) {
if (step == 0) {
gibbs_hvh(ph_sample, nv_means, nv_samples, nh_means, nh_samples,r);
}
else {
gibbs_hvh(nh_samples, nv_means, nv_samples, nh_means, nh_samples,r);
}
}
for (int i = 0; i<n_hidden; i++) {
for (int j = 0; j<n_visible; j++) {
W[i][j] += learning_rate * (ph_mean[i] * input[j] - nh_means[i] * nv_samples[j]) / N;
}
hbias[i] += learning_rate * (ph_sample[i] - nh_means[i]) / N;
}
//this directive
#pragma acc update device(hbias[0:n_hidden],W[0:n_hidden][0:n_visible])
for (int i = 0; i<n_visible; i++) {
vbias[i] += learning_rate * (input[i] - nv_samples[i]) / N;
}
//and this directive
#pragma acc update device(vbias[0:n_visible)
}
delete[] input;
delete[] ph_mean;
delete[] ph_sample;
delete[] nv_means;
delete[] nv_samples;
delete[] nh_means;
delete[] nh_samples;
}
But when I have many separated kernels working on each nested loops, I can update the variables:
void RBM::contrastive_divergence(int train_X[6][6], double learning_rate, int k) {
double r= rand() / (RAND_MAX + 1.0);
int * input = new int[n_visible];
double *ph_mean = new double[n_hidden];
int *ph_sample = new int[n_hidden];
double *nv_means = new double[n_visible];
int *nv_samples = new int[n_visible];
double *nh_means = new double[n_hidden];
int *nh_samples = new int[n_hidden];
for (int i = 0; i<train_N; i++) {
#pragma acc kernels
for (int j = 0; j< n_visible; j++){
input[j] = train_X[i][j];
}
sample_h_given_v(input, ph_mean, ph_sample,r);
#pragma acc kernels
for (int step = 0; step<k; step++) {
if (step == 0) {
gibbs_hvh(ph_sample, nv_means, nv_samples, nh_means, nh_samples,r);
}
else {
gibbs_hvh(nh_samples, nv_means, nv_samples, nh_means, nh_samples,r);
}
}
#pragma acc kernels
{
for (int i = 0; i<unhidden; i++) {
for (int j = 0; j<n_visible; j++) {
W[i][j] += learning_rate * (ph_mean[i] * input[j] - nh_means[i] * nv_samples[j]) / N;
}
hbias[i] += learning_rate * (ph_sample[i] - nh_means[i]) / N;
}
//this directive
#pragma acc update device(hbias[0:n_hidden],W[0:n_hidden][0:n_visible])
}
#pragma acc kernels
{
for (int i = 0; i<n_visible; i++) {
vbias[i] += learning_rate * (input[i] - nv_samples[i]) / N;
}
//and this directive
#pragma acc update device(vbias[0:n_visible)
}
}
delete[] input;
delete[] ph_mean;
delete[] ph_sample;
delete[] nv_means;
delete[] nv_samples;
delete[] nh_means;
delete[] nh_samples;
}
Upvotes: 2
Views: 715
Reputation: 5646
"Update" directives can only be used in host code since data movement must be initiated from the host. You can't have them within a compute region.
There are many issues with this code. First, it's probably poor practice to use the same index variable, "i" in this case, for nested loops. Although scoping rules allow it, it makes it difficult to tell which "i" the code is suppose to be using.
The outer "i" loop is probably not safe to parallelize so you shouldn't be putting the "kernels" directive outside this loop. Maybe if you privatized the "input" array and then used atomics when updating the vbias, hbias, W arrays it may work, but your performance would be poor. (you'd also need to determine if the other arrays need to be privatized or are global so need atomic operations).
What I'd suggest is to start by putting "#pragma acc parallel loop" around the inner loops, one at a time. Make sure each works before moving on the next one. Also, I highly doubt the "step" loop is parallelizable so you'll most likely need to parallelize the loops inside the "gibbs_hvh" subroutine instead.
Since you're using CUDA Unified Memory (-ta=tesla:managed) adding data regions probably isn't necessary. However if you are planning on not using Managed memory in the future, the next step would be to add data directives around the outer "i" loop (or at higher point in the program and then use update directive to synchronize data after the outer "i" loop).
Upvotes: 2