user18490
user18490

Reputation: 3819

Clang built-in matrix and vector extension: efficient matrix-vector multiplication

I am writing a small graphics 3D app, to learn about Clang vector and matrix extensions (matrices still seem to be developed if I read the right versions of the doc).

I am unsure about how to write the most efficient code for a matrix-vector multiplication using these type. Using:

typedef float float4 __attribute__((ext_vector_type(4)));
typedef float m4x4 __attribute__((matrix_type(4, 4)));

The doc says (regarding the indices to access the elements of a matrix):

The first specifies the number of rows, and the second specifies the number of columns.

     Column
        |
        v
Row->| M00 M01 M02 M03 |
     | M10 M11 M12 M13 |
     | M20 M21 M22 X23 |
     | M30 M31 M32 M33 |

So I get that doing m[2][3] (where m is a m4x4), would give me the element that I noted X in the matrix above.

Then (regarding the way the elements are laid out in memory):

The elements of a value of a matrix type are laid out in column-major order without padding.

So I get from this note that if I could look at the way the elements are stored in memory I would get:

M00 M10 M20 M30 - M01 M11 M21 M31 - M02 M12 M22 M32 - M03 M13 X23 M33 

Do I get it right so far?

Does the order in which we access the elements of the matrix matter? (and am I doing it right?)

Then I assume if I wanted to be efficient in my mat-float4 multiplication I'd need to access the elements in the way they are laid out in memory so do:

m4x3 m;
float4 v = {0.2, 0.3, 0.4, 1};
float4 res = {
    v.x * m[0][0] + v.y * m[1][0] + v.z * m[2][0] + v.w * m[3][0],
    v.x * m[0][1] + v.y * m[1][1] + v.z * m[2][1] + v.w * m[3][1],
    v.x * m[0][2] + v.y * m[1][2] + v.z * m[2][2] + v.w * m[3][2],
    1 // ignore w element for now
}

Of course it's up to me to load the right values in m[0][0], m[0][1], ... using something like __builtin_matrix_column_major_load.

Am I over-complicating things, or should the order matter here. Is the equation above effectively better than:

float4 res = {
    v.x * m[0][0] + v.y * m[0][1] + v.z * m[0][2] + v.w * m[0][3],
    v.x * m[1][0] + v.y * m[1][1] + v.z * m[1][2] + v.w * m[1][3],
    v.x * m[2][0] + v.y * m[2][1] + v.z * m[2][2] + v.w * m[2][3],
    1 // ignore w element for now
}

(assuming I have transposed the elements before calling __builtin_matrix_column_major_load.

Is there a better way of doing it?

Now I understand these types are being developed at the moment. Yet I understand that the whole point of these types is to take advatage of SIMD instructions. If I do:

float4 a = {...};
float4 b = {...};
float4 c = a + b;

Then adding the 4 floats of a to the respective 4 floats of b happens in a single cycle? So concerning the mat-float4 multiplication, because I call the elements of the float4 and m4x4 individually in my code, it seems that I wouldn't be taking advantage of any optimization in this particular case?

So my second question: is there a better way of doing this?

Any feedback, or advice would be greatly appreciated. I am doing this as an exercise to learn about these new built-in types.

Upvotes: 2

Views: 756

Answers (1)

bockyboh
bockyboh

Reputation: 31

In case anyone finds this now:

typedef float float4 __attribute__((ext_vector_type(4)));
typedef float float4x4 __attribute__((matrix_type(4, 4)));

float4 mulmv4(float4x4 mat, float4 vec) {
    typedef float float4x1 __attribute__((matrix_type(4, 1)));
    float4 dst;
    float4x1 col = __builtin_matrix_column_major_load((float *)&vec, 4, 1, 4);
    __builtin_matrix_column_major_store(mat * col, (float *)&dst, 4);
    return dst;
}

Cast to a column "matrix" and the product is defined. This really should be built-in, although, like you said, Clang matrix_types are WIP.

BTW: You can apply the same concept to the dot product of ext_vector_types since (AFAIK) that isn't built-in either. Dot would be multiplying a float1x4 by a float4x1 (in that order).

Upvotes: 2

Related Questions