piyumi_rameshka
piyumi_rameshka

Reputation: 350

OpenMP offloaded target region executed in both host and target-device

I'm working on a project which requires OpenMP offloading to Nvidia GPUs using Clang. I was able to install Clang to support offloading by following instructions mentioned here.

System specification

But the problem is I when I execute a sample program to test OpenMP to Nvidia GPUs, part of the target region tends to run in GPU and then same target region starts executing in the host.

Please find the sample program here, This a small C program written to multiply 2 matrices.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>

/* Problem size. */
# define N 1920

void init_array(float* A, float* B)
{
    int i, j;
    for (i = 0; i < N; i++)
    {
        for (j = 0; j < N; j++)
        {
            A[i*N + j] = ((float) i*j) / N;
        }
    }

    for (i = 0; i < N; i++)
    {
        for (j = 0; j < N; j++)
        {
            B[i*N + j] = ((float) i*(j+1)) / N;
        }
    }
}
void  mm_kernel(float *A, float *B, float *E)
{

    #pragma omp target data map(to:A) map(to:B) map(alloc:E)
{
    #pragma omp target
    #pragma omp teams distribute num_teams(4)
        for (int i = 0; i < N; i++)
  {
        printf("Team %d Thread %d Number of threads %d \n", omp_get_team_num() ,omp_get_thread_num(),omp_get_num_threads());
        #pragma omp  parallel for
        for (int j = 0; j < N; j++)
    {
            E[i*N + j] = 0.0;
            for(int k = 0; k < N; k++)
            {
                E[i*N + j] = E[i*N + j] + A[i*N + k] * B[j*N+k];
            }
    }
    }
  }
    }

int main(){
  double t_start, t_end;

    float* A;
    float* B;
    float* E;

    A = (float*)malloc(N*N*sizeof(float));
    B = (float*)malloc(N*N*sizeof(float));
    E = (float*)malloc(N*N*sizeof(float));
    init_array(A, B); //initialize Matrix A and B

    t_start = omp_get_wtime();
    mm_kernel(A,B,E);
    t_end = omp_get_wtime();

    printf("Time spent %lf\n",t_end-t_start );
    free(A);
    free(B);
    free(E);
}

The program was complied using

clang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda 3mm.c -o 3mmgpu 

The Main reason to claim that target regions are executing in both the host and the target device is due to the output from the command line.

command line output

At first team 0 and team 1 shows 960 per each team and later iterations gives 2 threads per each teams(My processor is 4 core processor capable of handling 2 hardware level threads per core.).

I also tried executing the fat binary with nvprof in order to verify whether anything is being executed in the GPU.

profiling results are as follows.

profiling result

Actually I cannot understand what is happening in the target region. Why the target region is being executed in both host and target-device.

Upvotes: 1

Views: 926

Answers (1)

piyumi_rameshka
piyumi_rameshka

Reputation: 350

I'm posting the answer to the question, as I was finally able to figure out what went wrong in the code. The problem was offloaded region in the target-device crashes as I have incorrectly mapped data to the GPU. I have only mapped pointers without allocating memory in the GPU. So as the GPU execution crashes, execution happens in the host.

Thank you @Alexey Bataev for pointing that out.

Upvotes: 2

Related Questions