C code to auto-vectorize floating point minimum

Question

I've got a little bit of code in my innermost loop that I'm using to clamp some error values for a rasterization algorithm I'm writing:

float cerror[4] = {
    MINF(error[0], 1.0f), 
    MINF(error[1], 1.0f), 
    MINF(error[2], 1.0f), 
    MINF(error[3], 1.0f) 
};

where MINF is just MINF(a,b) = ((a) < (b)) ? (a) : (b)

It turns out I've got 4 error values I have to update in this inner loop, all floats, so it'd be great if I could get them all stored in SSE registers and have the minimum computed with minps rather than separately, but the compiler doesn't seem to be doing that for me.

I even tried moving it to it's own function so I can see the vectorizer output:

void fclamp4(float* __restrict__ aa, float* __restrict__ bb) {
    for (size_t ii=0; ii < 4; ii++) {
        aa[ii] = (bb[ii] > 1.0) ? 1.0f : bb[ii];
    }
}

Which gives me something like:

inc/simplex.h:1508: note: not vectorized: unsupported data-type bool
inc/simplex.h:1507: note: vectorized 0 loops in function.

Is there a way to better encourage the compiler to do this for me? I'd rather not skip straight to instrinsics if I can avoid it so the code remains portable. Is there perhaps a general reference with common patterns?

Lastly, all of my error/cerror/error increments are stored in float[4] arrays on the stack, do I need to manually align those or can the compiler handle that for me?

Edit: playing around with an aligned type and still no dice.

#include                                                                                                                                                                                               
#include                                                                                                                                                                                              

typedef float __attribute__((aligned (16))) float4[4];                                                                                                                                                          

inline void doit(const float4 a, const float4 b, float4 c) {                                                                                                                                                    
    for (size_t ii=0; ii < 4; ii++) {                                                                                                                                                                           
        c[ii] = (a[ii] < b[ii]) ? a[ii] : b[ii];                                                                                                                                                                
    }                                                                                                                                                                                                           
}                                                                                                                                                                                                               

int main() {                                                                                                                                                                                                    
    float4 a = {rand(), rand(), rand(), rand() };                                                                                                                                                               
    float4 b = {1.0f,   1.0f,   1.0f,   1.0f   };                                                                                                                                                               
    float4 c;                                                                                                                                                                                                   

    doit((float*)&a, (float*)&b, (float*)&c);                                                                                                                                                                   

    printf("%f
", c[0]);                                                                                                                                                                                       
}

The vectorizer says:

ssetest.c:7: note: vect_model_load_cost: aligned.
ssetest.c:7: note: vect_model_load_cost: inside_cost = 4, outside_cost = 0 .
ssetest.c:7: note: vect_model_load_cost: aligned.
ssetest.c:7: note: vect_model_load_cost: inside_cost = 4, outside_cost = 0 .
ssetest.c:7: note: not vectorized: relevant stmt not supported: D.3177D.3177_22 = iftmp.4_18 < iftmp.4_21;

ssetest.c:12: note: vectorized 0 loops in function.

Edit again: I should note I've been trying this on GCC 4.4.7 (RHEL 6) and GCC 4.6 (Ubuntu), both without luck.

gct · Accepted Answer

It looks like in GCC vectorization of reductions isn't enabled unless you specify -ffast-math or -fassociative-math. when I enable those it vectorizes just fine (using fminf in the inner loop):

ssetest.c:9: note: vect_model_load_cost: aligned.
ssetest.c:9: note: vect_model_load_cost: inside_cost = 1, outside_cost = 0 .
ssetest.c:9: note: vect_model_load_cost: aligned.
ssetest.c:9: note: vect_model_load_cost: inside_cost = 1, outside_cost = 0 .
ssetest.c:9: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 .
ssetest.c:9: note: vect_model_store_cost: inside_cost = 1, outside_cost = 0 .
ssetest.c:9: note: Cost model analysis:
Vector inside of loop cost: 4
Vector outside of loop cost: 0
Scalar iteration cost: 4
Scalar outside cost: 0
prologue iterations: 0
epilogue iterations: 0
Calculated minimum iters for profitability: 1

ssetest.c:9: note: Profitability threshold = 3

ssetest.c:9: note: LOOP VECTORIZED.
ssetest.c:15: note: vectorized 1 loops in function.

C code to auto-vectorize floating point minimum

Answers (2)

Related Questions