Reputation: 3819
I am writing a small graphics 3D app, to learn about Clang vector and matrix extensions (matrices still seem to be developed if I read the right versions of the doc).
I am unsure about how to write the most efficient code for a matrix-vector multiplication using these type. Using:
typedef float float4 __attribute__((ext_vector_type(4)));
typedef float m4x4 __attribute__((matrix_type(4, 4)));
The doc says (regarding the indices to access the elements of a matrix):
The first specifies the number of rows, and the second specifies the number of columns.
Column
|
v
Row->| M00 M01 M02 M03 |
| M10 M11 M12 M13 |
| M20 M21 M22 X23 |
| M30 M31 M32 M33 |
So I get that doing m[2][3] (where m is a m4x4), would give me the element that I noted X in the matrix above.
Then (regarding the way the elements are laid out in memory):
The elements of a value of a matrix type are laid out in column-major order without padding.
So I get from this note that if I could look at the way the elements are stored in memory I would get:
M00 M10 M20 M30 - M01 M11 M21 M31 - M02 M12 M22 M32 - M03 M13 X23 M33
Do I get it right so far?
Then I assume if I wanted to be efficient in my mat-float4 multiplication I'd need to access the elements in the way they are laid out in memory so do:
m4x3 m;
float4 v = {0.2, 0.3, 0.4, 1};
float4 res = {
v.x * m[0][0] + v.y * m[1][0] + v.z * m[2][0] + v.w * m[3][0],
v.x * m[0][1] + v.y * m[1][1] + v.z * m[2][1] + v.w * m[3][1],
v.x * m[0][2] + v.y * m[1][2] + v.z * m[2][2] + v.w * m[3][2],
1 // ignore w element for now
}
Of course it's up to me to load the right values in m[0][0], m[0][1], ... using something like __builtin_matrix_column_major_load
.
Am I over-complicating things, or should the order matter here. Is the equation above effectively better than:
float4 res = {
v.x * m[0][0] + v.y * m[0][1] + v.z * m[0][2] + v.w * m[0][3],
v.x * m[1][0] + v.y * m[1][1] + v.z * m[1][2] + v.w * m[1][3],
v.x * m[2][0] + v.y * m[2][1] + v.z * m[2][2] + v.w * m[2][3],
1 // ignore w element for now
}
(assuming I have transposed the elements before calling __builtin_matrix_column_major_load
.
Now I understand these types are being developed at the moment. Yet I understand that the whole point of these types is to take advatage of SIMD instructions. If I do:
float4 a = {...};
float4 b = {...};
float4 c = a + b;
Then adding the 4 floats of a
to the respective 4 floats of b
happens in a single cycle? So concerning the mat-float4 multiplication, because I call the elements of the float4 and m4x4 individually in my code, it seems that I wouldn't be taking advantage of any optimization in this particular case?
So my second question: is there a better way of doing this?
__m128
and use those to get the matrix elements multiplied by the vector's elements using additional SIMD instructions such as _mm_add_ps
and mm_mul_ps
.Any feedback, or advice would be greatly appreciated. I am doing this as an exercise to learn about these new built-in types.
Upvotes: 2
Views: 756
Reputation: 31
In case anyone finds this now:
typedef float float4 __attribute__((ext_vector_type(4)));
typedef float float4x4 __attribute__((matrix_type(4, 4)));
float4 mulmv4(float4x4 mat, float4 vec) {
typedef float float4x1 __attribute__((matrix_type(4, 1)));
float4 dst;
float4x1 col = __builtin_matrix_column_major_load((float *)&vec, 4, 1, 4);
__builtin_matrix_column_major_store(mat * col, (float *)&dst, 4);
return dst;
}
Cast to a column "matrix" and the product is defined. This really should be built-in, although, like you said, Clang matrix_types are WIP.
BTW: You can apply the same concept to the dot product of ext_vector_types
since (AFAIK) that isn't built-in either. Dot would be multiplying a float1x4
by a float4x1
(in that order).
Upvotes: 2