user3509540
user3509540

Reputation: 41

OpenCL Matrix Multiplication Altera Example

I am very new to OpenCL and am going through the Altera OpenCL examples. In their matrix multiplication example, they have used the concept of blocks, where dimensions of the input matrices are multiple of block size. Here's the code:

void matrixMult( // Input and output matrices
        __global float *restrict C,
        __global float *A,
        __global float *B, 
        // Widths of matrices.
        int A_width, int B_width)
{
    // Local storage for a block of input matrices A and B
    __local float A_local[BLOCK_SIZE][BLOCK_SIZE];
    __local float B_local[BLOCK_SIZE][BLOCK_SIZE];

    // Block index
    int block_x = get_group_id(0);
    int block_y = get_group_id(1);

    // Local ID index (offset within a block)
    int local_x = get_local_id(0);
    int local_y = get_local_id(1);

    // Compute loop bounds
    int a_start = A_width * BLOCK_SIZE * block_y;
    int a_end   = a_start + A_width - 1;
    int b_start = BLOCK_SIZE * block_x;

    float running_sum = 0.0f;
    for (int a = a_start, b = b_start; a <= a_end; a += BLOCK_SIZE, b += (BLOCK_SIZE * B_width))
    {
        A_local[local_y][local_x] = A[a + A_width * local_y + local_x];
        B_local[local_x][local_y] = B[b + B_width * local_y + local_x];
        #pragma unroll
        for (int k = 0; k < BLOCK_SIZE; ++k)
        {
            running_sum += A_local[local_y][k] * B_local[local_x][k];
        }
    }

    // Store result in matrix C
    C[get_global_id(1) * get_global_size(0) + get_global_id(0)] = running_sum;
}

Assume block size is 2, then: block_x and block_y are both 0; and local_x and local_y are both 0.
Then A_local[0][0] would be A[0] and B_local[0][0] would be B[0].
Sizes of A_local and B_local are 4 elements each.

In that case, how would A_local and B_local access other elements of the block in that iteration?
Also would separate threads/cores be assigned for each local_x and local_y?

Upvotes: 2

Views: 775

Answers (2)

mfa
mfa

Reputation: 5087

There is definitely a barrier missing in your code sample. The outer for loop as you have it will only produce correct results if all work items are executing instructions in lockstep fashion, thus guaranteeing the local memory is populated before the for k loop.

Maybe this is the case for Altera and other FPGAs, but this is not correct for CPUs and GPUs.

You should add barrier(CLK_LOCAL_MEM_FENCE); if you are getting unexpected results, or want to be compatible with other type of hardware.

float running_sum = 0.0f;
for (int a = a_start, b = b_start; a <= a_end; a += BLOCK_SIZE, b += (BLOCK_SIZE * B_width))
{
    A_local[local_y][local_x] = A[a + A_width * local_y + local_x];
    B_local[local_x][local_y] = B[b + B_width * local_y + local_x];

    barrier(CLK_LOCAL_MEM_FENCE);

    #pragma unroll
    for (int k = 0; k < BLOCK_SIZE; ++k)
    {
        running_sum += A_local[local_y][k] * B_local[local_x][k];
    }
}

Upvotes: 1

Gilles
Gilles

Reputation: 9489

A_local and B_local are both shared by all work items of the work group, so all their elements are loaded in parallel (by all work items of the work group) at each step of the encompassing for loop.

Then each work item uses some of the loaded values (not necessarily the values the work item loaded itself) to do its share of the computation.

And finally, the work item stores its individual result into the global output matrix.

It is a classical tiled implementation of a matrix-matrix multiplication. However, I'm really surprised not to see any sort of call to a memory synchronisation function, such as work_group_barrier(CLK_LOCAL_MEM_FENCE) between the load of A_local and B_local and their use in the k loop... But I might very well have overlooked something here.

Upvotes: 1

Related Questions