Reputation: 14583
I've got a little bit of code in my innermost loop that I'm using to clamp some error values for a rasterization algorithm I'm writing:
float cerror[4] = {
MINF(error[0], 1.0f),
MINF(error[1], 1.0f),
MINF(error[2], 1.0f),
MINF(error[3], 1.0f)
};
where MINF is just MINF(a,b) = ((a) < (b)) ? (a) : (b)
It turns out I've got 4 error values I have to update in this inner loop, all floats, so it'd be great if I could get them all stored in SSE registers and have the minimum computed with minps rather than separately, but the compiler doesn't seem to be doing that for me.
I even tried moving it to it's own function so I can see the vectorizer output:
void fclamp4(float* __restrict__ aa, float* __restrict__ bb) {
for (size_t ii=0; ii < 4; ii++) {
aa[ii] = (bb[ii] > 1.0) ? 1.0f : bb[ii];
}
}
Which gives me something like:
inc/simplex.h:1508: note: not vectorized: unsupported data-type bool
inc/simplex.h:1507: note: vectorized 0 loops in function.
Is there a way to better encourage the compiler to do this for me? I'd rather not skip straight to instrinsics if I can avoid it so the code remains portable. Is there perhaps a general reference with common patterns?
Lastly, all of my error/cerror/error increments are stored in float[4] arrays on the stack, do I need to manually align those or can the compiler handle that for me?
Edit: playing around with an aligned type and still no dice.
#include <stdio.h>
#include <stdlib.h>
typedef float __attribute__((aligned (16))) float4[4];
inline void doit(const float4 a, const float4 b, float4 c) {
for (size_t ii=0; ii < 4; ii++) {
c[ii] = (a[ii] < b[ii]) ? a[ii] : b[ii];
}
}
int main() {
float4 a = {rand(), rand(), rand(), rand() };
float4 b = {1.0f, 1.0f, 1.0f, 1.0f };
float4 c;
doit((float*)&a, (float*)&b, (float*)&c);
printf("%f\n", c[0]);
}
The vectorizer says:
ssetest.c:7: note: vect_model_load_cost: aligned.
ssetest.c:7: note: vect_model_load_cost: inside_cost = 4, outside_cost = 0 .
ssetest.c:7: note: vect_model_load_cost: aligned.
ssetest.c:7: note: vect_model_load_cost: inside_cost = 4, outside_cost = 0 .
ssetest.c:7: note: not vectorized: relevant stmt not supported: D.3177D.3177_22 = iftmp.4_18 < iftmp.4_21;ssetest.c:12: note: vectorized 0 loops in function.
Edit again: I should note I've been trying this on GCC 4.4.7 (RHEL 6) and GCC 4.6 (Ubuntu), both without luck.
Upvotes: 5
Views: 833
Reputation: 620
fminf involves special requirements on treatment of non-finite operands, which gcc can ignore when -ffinite-math-only is set (as -ffast-math does).
Upvotes: 0
Reputation: 14583
It looks like in GCC vectorization of reductions isn't enabled unless you specify -ffast-math or -fassociative-math. when I enable those it vectorizes just fine (using fminf in the inner loop):
ssetest.c:9: note: vect_model_load_cost: aligned.
ssetest.c:9: note: vect_model_load_cost: inside_cost = 1, outside_cost = 0 .
ssetest.c:9: note: vect_model_load_cost: aligned.
ssetest.c:9: note: vect_model_load_cost: inside_cost = 1, outside_cost = 0 .
ssetest.c:9: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 .
ssetest.c:9: note: vect_model_store_cost: inside_cost = 1, outside_cost = 0 .
ssetest.c:9: note: Cost model analysis:
Vector inside of loop cost: 4
Vector outside of loop cost: 0
Scalar iteration cost: 4
Scalar outside cost: 0
prologue iterations: 0
epilogue iterations: 0
Calculated minimum iters for profitability: 1ssetest.c:9: note: Profitability threshold = 3
ssetest.c:9: note: LOOP VECTORIZED.
ssetest.c:15: note: vectorized 1 loops in function.
Upvotes: 1