Reputation: 16978
I understand the theory of matrix multiplication, I just have two questions about this particular kernel implementation:
For reference, num_rows = 32. The matrix B (b_mat) has been transposed before by another kernel, so as I understand it we're dot-ting row vectors together.
1) why do we need to use the param "vectors_per_row" and thus the inner loop? I thought we could just do sum += dot(row of A, row of B), and it seems like this param is splitting up the row into smaller portions (why?).
2) I don't understand the address offset for a_mat and b_mat, i.e. a_mat += start; b_mat += start*4;
__kernel void matrix_mult(__global float4 *a_mat,
__global float4 *b_mat, __global float *c_mat) {
float sum;
int num_rows = get_global_size(0);
int vectors_per_row = num_rows/4;
int start = get_global_id(0) * vectors_per_row;
a_mat += start;
c_mat += start*4;
for(int i=0; i<num_rows; i++) {
sum = 0.0f;
for(int j=0; j<vectors_per_row; j++) {
sum += dot(a_mat[j],
b_mat[i*vectors_per_row + j]);
}
c_mat[i] = sum;
}
}
Upvotes: 0
Views: 880
Reputation: 28302
Your matrix is composed of an array of float4's. Flaoa4's are vectors of 4 floats. This is where the 4 comes from. Dot only works with the builtin types, so you have to do it on the float4.
c_mat is of type float, which is why it has start*4 and a_mat has start. The offset is because the code is split up across several (potentially hundreds) of threads. Each thread is only calculating a small part of the multiply operation. start
is simply where the thread starts computing. This is what the get_global_id(0) is for. It essentially gets your thread id. Technically it's the thread index of the first dimension, but it appears you only have one thread dimension, so here you can just think of it as thread id.
Upvotes: 2