Reputation: 41
I have a matrix class(4x4)
class matrix {
public:
matrix() {}
matrix(float m11,float m21,float m31,float m41,
float m12,float m22,float m32,float m42,
float m13,float m23,float m33,float m43,
float m14,float m24,float m34,float m44);
matrix(const float*);
matrix(const matrix&);
matrix operator *(const matrix& other)const;
static const matrix identity;
private:
union {
float m[16];
struct {
float m11,m21,m31,m41;
float m12,m22,m32,m42;
float m13,m23,m33,m43;
float m14,m24,m34,m44;
};
struct {
float element[4][4];
};
};
};
below is the first implementation of the multiplication operator,
matrix matrix::operator*(const matrix &other) const{
return matrix(
m11*other.m11+m12*other.m21+m13*other.m31+m14*other.m41,
m21*other.m11+m22*other.m21+m23*other.m31+m24*other.m41,
m31*other.m11+m32*other.m21+m33*other.m31+m34*other.m41,
m41*other.m11+m42*other.m21+m43*other.m31+m44*other.m41,
m11*other.m12+m12*other.m22+m13*other.m32+m14*other.m42,
m21*other.m12+m22*other.m22+m23*other.m32+m24*other.m42,
m31*other.m12+m32*other.m22+m33*other.m32+m34*other.m42,
m41*other.m12+m42*other.m22+m43*other.m32+m44*other.m42,
m11*other.m13+m12*other.m23+m13*other.m33+m14*other.m43,
m21*other.m13+m22*other.m23+m23*other.m33+m24*other.m43,
m31*other.m13+m32*other.m23+m33*other.m33+m34*other.m43,
m41*other.m13+m42*other.m23+m43*other.m33+m44*other.m43,
m11*other.m14+m12*other.m24+m13*other.m34+m14*other.m44,
m21*other.m14+m22*other.m24+m23*other.m34+m24*other.m44,
m31*other.m14+m32*other.m24+m33*other.m34+m34*other.m44,
m41*other.m14+m42*other.m24+m43*other.m34+m44*other.m44
);
}
and i try to use sse instructions to accelerate with the version below,
matrix matrix::operator*(const matrix &other) const{
float r[4][4];
__m128 c1=_mm_loadu_ps(&m11);
__m128 c2=_mm_loadu_ps(&m12);
__m128 c3=_mm_loadu_ps(&m13);
__m128 c4=_mm_loadu_ps(&m14);
for (int i = 0;i < 4; ++i) {
__m128 v1 = _mm_set1_ps(other.element[i][0]);
__m128 v2 = _mm_set1_ps(other.element[i][1]);
__m128 v3 = _mm_set1_ps(other.element[i][2]);
__m128 v4 = _mm_set1_ps(other.element[i][3]);
__m128 col = _mm_add_ps(
_mm_add_ps(_mm_mul_ps(v1,c1),_mm_mul_ps(v2,c2)),
_mm_add_ps(_mm_mul_ps(v3,c3),_mm_mul_ps(v4,c4))
);
_mm_storeu_ps(r[i], col);
}
return matrix(&r[0][0]);
}
But on my macbookpro, doing 100000 matrix multiplication costs about 6ms for the first version, and about 8ms for the second version. i want to know why this happens. Perhaps because of cpu pipeline makes the first version runs concurrent computations and the load/save lags the second version?
Upvotes: 4
Views: 377
Reputation: 244823
You benefit from massive instruction parallelism in the first (scalar) case, when you allow the compiler to optimize the code as it sees best. By arranging the code so as to minimize data dependencies, even though that may result in more total instructions being required, each instruction can be run simultaneously on different execution units. There are lots of registers available, so most of the values can be kept enregistered, minimizing the need for costly memory reads, and even when memory reads are necessary, they can be done nearly for free while other operations are completing, thanks to out-of-order execution scheduling. I would further speculate that you are benefitting from μ-op caching here, the benefit of which is compensating for the increased code size.
In the second (parallel) case, you're creating significant data dependencies. Even when the compiler emits optimal object code (and this isn't necessarily going to be the case when you use intrinsics), there is a cost involved in forcing this parallelism. You can see that if you ask the compiler to show you an assembly listing. There are tons of shufps
instructions required to pack and reorder the floating-point operands within the SSE registers between operations. That only takes a single cycle on modern Intel architectures*, but the subsequent addps
and mulps
operations cannot execute in parallel. They have to wait for it to complete. Chances are very good that this code is hitting up against a hard μ-op throughput bottleneck. (You may also be paying an unaligned data penalty in this code, but that is minimal on modern architectures.)
In other words, you've traded parallelism (at the expense of larger code) for increased data dependencies (albeit with smaller code). At least, that would be my semi-educated guess, looking at the disassembly for your example code. In this case, your benchmark tells you very clearly that it did not work out in your favor.
Things might change if you instructed the compiler to assume AVX support. If the target architecture does not support AVX, the compiler has no choice but to transform your _mm_set1_ps
intrinsic into a pair of movss
, shufps
instructions. If you enable AVX support, you'll get a single vbroadcastss
instruction instead, which may be faster, especially with AVX2 support, where you can broadcast from register-to-register (instead of only from memory-to-register). With AVX support, you also get the benefit of VEX-encoded instructions.
* Although on certain older architectures like Core 2, shufps
was an integer-based instruction, and therefore resulted in a delay when it was followed by a floating-point instruction like addps
or mulps
. I can't remember when exactly this was fixed, but certainly it is not a problem on Sandy Bridge and later.
Upvotes: 3