coptimizationvectorizationsimddot-product

Reputation: 11

How can I optimize this "dot product" function using SIMD? It's Mat4x4 * Vec4 but with huge strided access

I'm having a huge issue trying to get the best speedups for this function but I can't write effective SIMD code that beats the auto-vectorizer. I need to write some SIMD to beat it but I am completely stuck right now:

static int  c_coef_4tap[4][4] = {
     0, 64,  0,  0,
    -4, 54, 16, -2,
    -4, 36, 36, -4,
    -2, 16, 54, -4,
};

static void s_read_qpel_ref(int list, int ref_pos_x, int ref_pos_y, int blk_size, int ref_buf[])
{
    int  int_buf[35 * 35];                              // 1+32+2 = 35
    int  int_x, int_y, frac_x, frac_y;
    int  pixel[4][4], hor[4], ver;
    int  x, y, h, v;

    int_x = ref_pos_x >> 2;
    int_y = ref_pos_y >> 2;
    frac_x = ref_pos_x & 3;
    frac_y = ref_pos_y & 3;

   //int_buf fill-up function is here. it's too long but is not the bottleneck

  int  int_buf[35 * 35] = {139, 140, 141, 142, 143, 145, 146, 147, 148, 149, 150, 139, 140, 141, 142, 143, 145, 146, 147, 148, 149, 150, 
    139, 140, 141, 142, 143, 145, 146, 147, 148, 149, 150, 139, 140, 141, 142, 143, 145, 146, 147, 148, 149, 150, 139, 140, 141, 142, 143, 145, 146, 147, 148, 149, 150, 139, 140, 141, 142, 143, 145, 146, 147, 148, 149, 
    150, 139, 140, 141, 142, 143, 145, 146, 147, 148, 149, 150, 139, 140, 141, 142, 143, 145, 146, 147, 148, 149, 150, 139, 140, 141, 142, 143, 145, 146, 147, 148, 149, 150, 139, 140, 141, 142, 143, 145, 146, 147, 148, 
    149, 150, 139, 140, 141, 142, 143, 145, 146, 147, 148, 149, 150, 138, 139, 140, 141, 142, 143, 145, 146, 147, 148, 149, 150, 130, 131, 132, 133, 134, 136, 137, 138, 139, 140, 141, 142, 143, 145, 146, 147, 148, 149, 
    150, 130, 131, 132, 133, 134, 136, 137, 138, 139, 140, 141, 142, 143, 145, 146, 147, 148, 149, 150, 130, 131, 132, 133, 134, 136, 137, 138, 139, 140, 141, 142, 143, 145, 146, 147, 148, 149, 150, 130, 131, 132, 133, 
    134, 136, 137, 138, 139, 140, 141, 142, 143, 145, 146, 147, 148, 149, 150, 130, 131, 132, 133, 134, 136, 137, 138, 139, 140, 141, 142, 143, 145, 146, 147, 148, 149, 150, 130, 131, 132, 133, 134, 136, 137, 138, 139, 
    140, 141, 142, 143, 145, 146, 147, 148, 149, 150, 130, 131, 132, 133, 134, 136, 137, 138, 139, 140, 141, 142, 143, 145, 146, 147, 148, 149, 150, 130, 131, 132, 133, 134, 136, 137, 138, 139, 140, 141, 142, 143, 145, 
    146, 147, 148, 149, 150, 130, 131, 132, 133, 134, 136, 137, 138, 139, 140, 141, 142, 143, 145, 146, 147, 148, 149, 150, 130, 131, 132, 133, 134, 136, 137, 138, 139, 140, 141, 142, 143, 145, 146, 147, 148, 149, 150, 
    130, 131, 132, 133, 134, 136, 137, 138, 139, 140, 141, 142, 143, 145, 146, 147, 148, 149, 150, 130, 131, 132, 133, 134, 136, 137, 138, 139, 140, 141, 142, 143, 145, 146, 147, 148, 149, 150, 123, 124, 125, 126, 127, 
    128, 129, 130, 117, 118, 119, 120, 121, 122, 123, 124, 123, 124, 125, 126, 127, 128, 129, 130, 117, 118, 119, 120, 121, 122, 123, 124, 123, 124, 125, 126, 127, 128, 129, 130, 117, 118, 119, 120, 121, 122, 123, 124, 
    123, 124, 125, 126, 127, 128, 129, 130, 117, 118, 119, 120, 121, 122, 123, 124, 123, 124, 125, 126, 127, 128, 129, 130, 115, 116, 117, 118, 119, 120, 121, 122, 124, 124, 125, 126, 127, 128, 129, 130, 115, 116, 117,
     118, 119, 120, 121, 122, 124, 124, 125, 126, 127, 128, 129, 130, 115, 116, 117, 118, 119, 120, 121, 122, 124, 124, 125, 126, 127, 128, 129, 130, 115, 116, 117, 118, 119, 120, 121, 122, 124, 124, 125, 126, 127, 128, 
     129, 130, 115, 116, 117, 118, 119, 120, 121, 122, 124, 124, 125, 126, 127, 128, 129, 130, 115, 116, 117, 118, 119, 120, 121, 122, 124, 124, 125, 126, 127, 128, 129, 130, 115, 116, 117, 118, 119, 120, 121, 122, 124,
      124, 125, 126, 127, 128, 129, 130, 115, 116, 117, 118, 119, 120, 121, 122, 124, 124, 125, 126, 127, 128, 129, 130, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94,
       94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 94, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135,
        135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135,
         135, 135, 135, 135, 135, 135, 135, 135, 135, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1450347658, -8299368, 1450464598, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1453931904, 
         -8199900, 1451012217, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1448624268, 0, 
         0, 4, 16, 1452089120, 79, -8299912, 1449255164, 81299392, 4, 0, 4, -8299780, 0, 0, 1449255134, 0, 1452089120};   


    for (y=0; y<blk_size; y++) {
        for (x=0; x<blk_size; x++) {
            for (v=0; v<4; v++) {
                for (h=0; h<4; h++) {
                    pixel[v][h] = int_buf[(y + v) * (blk_size+3) + (x + h)];
                }
            }
            // 4x4 filtering
            for (v=0; v<4; v++) {
                hor[v] = (pixel[v][0] * c_coef_4tap[frac_x][0] +
                          pixel[v][1] * c_coef_4tap[frac_x][1] +
                          pixel[v][2] * c_coef_4tap[frac_x][2] +
                          pixel[v][3] * c_coef_4tap[frac_x][3] + 32) >> 6;
                hor[v] = I_CLIP3(0, 255, hor[v]);
            }
            ver = (hor[0] * c_coef_4tap[frac_y][0] +
                   hor[1] * c_coef_4tap[frac_y][1] +
                   hor[2] * c_coef_4tap[frac_y][2] +
                   hor[3] * c_coef_4tap[frac_y][3] + 32) >> 6;
            ref_buf[y * blk_size + x] = I_CLIP3(0, 255, ver);
        }
    }
}

The current optimization flags I am using are: -O3 -march=native -funroll-loops -flto

These are giving me the best speedups no problem. But I have to make it even faster.

Currently, I have profiled the code and figured out that 'hor[v]' calculations take the most time. I have tried horizontal add and it makes it way slower. I have also looked at the auto-vectorizer dumps and it shows that the x loop is being vectorized. My SIMD intrinsic code has not been vectorizing the x loop at all. In fact, when I introduce intrinsics, it stops vectorizing the x-loop. Can someone please tell me what to do here?

I would try transposing and using dot products, but then the x-loop again is not being vectorized. the strided access of int_buf[] to fill up the pixel array is causing huge issues as well I think.

Also even though these are ints. the pixel array does not go beyond 10 bit integers. Thinking if using 16 bit multiplication would help.

Upvotes: 1

Answers (2)

Craig Estey

Reputation: 33631

I've completely revamped the code I posted in my prior answer: https://stackoverflow.com/a/78411530/5382650

It used openmp to speed things up. But, OP wanted to use x86 simd instructions.

I've removed the openmp code and added simd code instead.

The new code would not fit appended to the prior answer. My prior answer has a lot of background/discussion, and (more importantly) benchmark data, so I'd like to keep it.

Below is the new code and revised benchmark results.

FILE: dbgtype.h

// dbglib/dbgtype.h -- debug/core library

#ifndef _dbglib_dbgtype_h_
#define _dbglib_dbgtype_h_

#define _GNU_SOURCE

typedef unsigned int u32;

#define MTYP(_typ,_pre) \
    typedef _typ _pre##_t; \
    typedef _pre##_t *_pre##_p; \
    typedef const _pre##_t *_pre##_pc

#define MFWD(_typ) \
    MTYP(struct _typ,_typ)

#define UFWD(_typ) \
    MTYP(union _typ,_typ)

#define MPACKED     __attribute__((__packed__))

#if 0
#define ALIGNAS(_align) \
    _Alignas(_align)
#else
#define ALIGNAS(_align) \
    __attribute__((__aligned__(_align)))
#endif

#define CAST(_typ,_sym) \
    ((_typ) (_sym))

#define UCMP(_lhs,_rhs) \
    ((_lhs > _rhs) - (_lhs < _rhs))

#define countof(_arr) \
    (sizeof(_arr) / sizeof(_arr[0]))

#endif

FILE: mtxsmd.h

// mtxlib/mtxsmd.h -- simd matrix library
//
// pxor
//   __m128i _mm_setzero_si128
// paddd
//   __m128i _mm_add_epi32
// movdqu
//   __m128i _mm_loadu_si128
//   __m128i _mm_storeu_si128
// movdqa
//   __m128i _mm_load_si128
//   __m128i _mm_store_si128
// movdqa
//   __m128i _mm_store_si128
// pmulld
//   __m128i _mm_mullo_epi32
// pextrd
//   int __mm_extract_epi32

#ifndef _mtxsmd_mtxsmd_h_
#define _mtxsmd_mtxsmd_h_

#include <x86intrin.h>
#if 0
#include <mmintrin.h>
#include <xmmintrin.h>
#include <emmintrin.h>
#include <pmmintrin.h>
#include <tmmintrin.h>
#include <ammintrin.h>
#include <smmintrin.h>
#include <wmmintrin.h>
#endif

// use aligned SIMD instructions
// mtxgeo_t must be aligned
#ifndef MTXSMD_A
#define MTXSMD_A            1
#endif

// data type (e.g. int / s64)
#ifndef MTXDAT
#define MTXDAT              int
#endif
MTYP(MTXDAT,mtxdat);

MTYP(__m128i,mtxsmd);

#ifdef __clang__
typedef __m128i *mtxsmdu_p;
#else
typedef __m128i_u *mtxsmdu_p;
#endif

// cheap conversion trick (vs store)
UFWD(mtxsmdcvt);
union mtxsmdcvt {
    mtxsmd_t cvt_smd;
    mtxdat_t cvt_dat[4];
};

#define _MTXSMDINC(_typ) \
    (sizeof(mtxsmd_t) / sizeof(_typ))
#define MTXSMDINC \
    _MTSMDINC(mtxdat_t)

#if MTXSMD_A
#define MTXGEO_ALIGNAS      ALIGNAS(sizeof(mtxsmd_t))
#else
#define MTXGEO_ALIGNAS      /**/
#endif

#define MTXSMDZERO \
    _mm_setzero_si128()

// mtxsmdlda -- load aligned
inline_always mtxsmd_t
mtxsmdlda(const void *vp)
{
    return _mm_load_si128(CAST(mtxsmd_p,vp));
}

// mtxsmdldu -- load unaligned
inline_always mtxsmd_t
mtxsmdldu(const void *vp)
{
    return _mm_loadu_si128(CAST(mtxsmdu_p,vp));
}

// mtxsmdmulv -- multiply vector
#define _mtxsmdmulv(_vdst,_vsrc) \
    _mm_mullo_epi32(_vdst,_vsrc)
inline_always mtxsmd_t
mtxsmdmulv(mtxsmd_t vdst,mtxsmd_t vsrc)
{
    return _mtxsmdmulv(vdst,vsrc);
}
inline_always mtxsmd_t
mtxsmdmulp(mtxsmd_t vdst,mtxsmd_p vsrc)
{
    return _mtxsmdmulv(vdst,*vsrc);
}

// mtxsmdaddv -- add vector
#define _mtxsmdaddv(_vdst,_vsrc) \
    _mm_add_epi32(_vdst,_vsrc)
inline_always mtxsmd_t
mtxsmdaddv(mtxsmd_t vdst,mtxsmd_t vsrc)
{
    return _mtxsmdaddv(vdst,vsrc);
}
inline_always mtxsmd_t
mtxsmdaddp(mtxsmd_t vdst,mtxsmd_p vsrc)
{
    return _mtxsmdmulv(vdst,*vsrc);
}

// mtxsmdsumh -- horizontal sum of xmm vector
inline_always mtxdat_t
mtxsmdsumh(mtxsmd_t sumv)
{
    mtxdat_t sum = 0;

    sum += _mm_extract_epi32(sumv,0);
    sum += _mm_extract_epi32(sumv,1);
    sum += _mm_extract_epi32(sumv,2);
    sum += _mm_extract_epi32(sumv,3);

    return sum;
}

#endif

FILE: simd.c

// simd.c -- benchmark program

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <errno.h>
#include <time.h>
#include <malloc.h>
#include <sys/mman.h>
#include <sys/wait.h>

#define inline_always static inline __attribute__((always_inline))

#define MTXDAT  int
#include <dbgtype.h>
#include <mtxsmd.h>

#ifdef HAVE_BIGMAT_H
#include <bigmat.h>
#else
const int int_dft[35 * 35] = { 1, 2, 3 };
#undef INT_LINEAR
#define INT_LINEAR  1
#endif

#undef sysfault
#define sysfault(_fmt...) \
    do { \
        printf(_fmt); \
        exit(1); \
    } while (0)

typedef long long tsc_t;
#define TSCSEC      1000000000

tsc_t
tscget(void)
{
    struct timespec ts;
    tsc_t tsc;

    clock_gettime(CLOCK_MONOTONIC,&ts);
    tsc = ts.tv_sec;
    tsc *= TSCSEC;
    tsc += ts.tv_nsec;

    return tsc;
}

double
tscsec(tsc_t tsc)
{
    double sec = tsc;
    sec /= TSCSEC;
    return sec;
}

tsc_t t_beg;
tsc_t t_end;
tsc_t t_dif;

#define BEG \
    t_beg = tscget()

#define END \
    do { \
        t_end = tscget(); \
        t_dif = t_end - t_beg; \
    } while (0)

#if 0
static int c_coef_4tap[4][4] = {
    0, 64, 0, 0,
    -4, 54, 16, -2,
    -4, 36, 36, -4,
    -2, 16, 54, -4,
};
#else
const int ALIGNAS(64) c_coef_4tap[4][4] = {
    { 0, 64, 0, 0 },
    { -4, 54, 16, -2 },
    { -4, 36, 36, -4 },
    { -2, 16, 54, -4 },
};
#endif

typedef void
(*fncptr)(int list, int ref_pos_x, int ref_pos_y, int blk_size, int ref_buf[]);

typedef mtxsmd_pc
(*fncvec)(int goflg,int blk_size,const int *int_buf);

MFWD(fncdef);
struct fncdef {
    const char *fnc_sym;
    const char *fnc_reason;
    u32 fnc_opt;

    union {
        fncptr fnc_ptr;
        fncvec fnc_vec;
    };

    int *fnc_ref;
    tsc_t fnc_elap;
};

enum {
    FNCTYP_NORMAL,
    FNCTYP_PIXVEC,
};

enum {
    FNCOPT_BASELINE = (1u << 8),
    FNCOPT_ALLOC = (1u << 9),
};

#define FNCTYPE(_fnc) \
    (_fnc->fnc_opt & 0xFF)

size_t ref_siz;
int opt_s;

inline_always int
I_CLIP3(int lo,int hi,int val)
{

    do {
        if (val < lo) {
            val = lo;
            break;
        }

        if (val > hi) {
            val = hi;
            break;
        }
    } while (0);

    return val;
}

void
int_fill(int *int_buf)
{

#if INT_LINEAR
    for (int idx = 0;  idx < (35 * 35);  ++idx)
        int_buf[idx] = idx;
#else
    for (int idx = 0;  idx < (35 * 35);  ++idx)
        int_buf[idx] = int_dft[idx];
#endif
}

void
showfnc(fncdef_pc fnc)
{

    printf("ELAPSED: %.9f %s (%s)\n",
        tscsec(fnc->fnc_elap),fnc->fnc_sym,fnc->fnc_reason);
}

mtxsmd_pc
pixel_vector(int goflg,int blk_size,const int *int_buf)
{
    static int save_blksize;
    static const int *save_buf;
    static mtxdat_t save_pixel[35 * 35 * 4 * 4] ALIGNAS(64);

    do {
        if (! goflg) {
            if ((blk_size == save_blksize) && (int_buf == save_buf))
                break;
        }

        int stride = blk_size + 3;

        BEG;

        int *dst = save_pixel;
        for (int y = 0; y < blk_size; y++) {
            for (int x = 0; x < blk_size; x++) {
                for (int v = 0; v < 4; v++) {
                    for (int h = 0; h < 4; h++, ++dst)
                        *dst = int_buf[(y + v) * stride + (x + h)];
                }
            }
        }

        END;

        save_blksize = blk_size;
        save_buf = int_buf;
    } while (0);

    return (mtxsmd_p) save_pixel;
}

mtxsmd_pc
pixel_vecsmd(int goflg,int blk_size,const int *int_buf)
{
    static int save_blksize;
    static const int *save_buf;
    static mtxdat_t save_pixel[35 * 35 * 4 * 4] ALIGNAS(64);

    do {
        if (! goflg) {
            if ((blk_size == save_blksize) && (int_buf == save_buf))
                break;
        }

        int stride = blk_size + 3;

        BEG;

        mtxsmd_p dst = (mtxsmd_p) save_pixel;
        for (int y = 0; y < blk_size; y++) {
            for (int x = 0; x < blk_size; x++) {
                for (int v = 0; v < 4; v++, ++dst) {
                    mtxsmd_pc src = CAST(mtxsmd_pc,
                        &int_buf[(y + v) * stride + x]);
                    *dst = mtxsmdldu(src);
                }
            }
        }

        END;

        save_blksize = blk_size;
        save_buf = int_buf;
    } while (0);

    return (mtxsmd_p) save_pixel;
}

void
s_orig(int list, int ref_pos_x, int ref_pos_y, int blk_size,
    int ref_buf[])
{
#if 0
    int int_x, int_y;
#endif
    int frac_x, frac_y;
    int pixel[4][4];
    int hor[4];
    int ver;
    int x, y, h, v;

#if 0
    int_x = ref_pos_x >> 2;
    int_y = ref_pos_y >> 2;
#endif
    frac_x = ref_pos_x & 3;
    frac_y = ref_pos_y & 3;

    // int_buf fill-up function is here. it's too long but is not the bottleneck
#if INT_LINEAR
    int int_buf[35 * 35];               // 1+32+2 = 35
    int_fill(int_buf);
#else
    const int *int_buf = int_dft;
#endif

    BEG;

    for (y = 0; y < blk_size; y++) {
        for (x = 0; x < blk_size; x++) {
            for (v = 0; v < 4; v++) {
                for (h = 0; h < 4; h++) {
                    pixel[v][h] = int_buf[(y + v) * (blk_size + 3) + (x + h)];
                }
            }

            // 4x4 filtering
            for (v = 0; v < 4; v++) {
                hor[v] = (pixel[v][0] * c_coef_4tap[frac_x][0] +
                    pixel[v][1] * c_coef_4tap[frac_x][1] +
                    pixel[v][2] * c_coef_4tap[frac_x][2] +
                    pixel[v][3] * c_coef_4tap[frac_x][3] + 32) >> 6;
                hor[v] = I_CLIP3(0, 255, hor[v]);
            }

            ver = (hor[0] * c_coef_4tap[frac_y][0] +
                hor[1] * c_coef_4tap[frac_y][1] +
                hor[2] * c_coef_4tap[frac_y][2] +
                hor[3] * c_coef_4tap[frac_y][3] + 32) >> 6;

            ref_buf[y * blk_size + x] = I_CLIP3(0, 255, ver);
        }
    }

    END;
}

void
s_simd_1(int list, int ref_pos_x, int ref_pos_y, int blk_size,
    int ref_buf[])
{
    int pixel[4][4] ALIGNAS(sizeof(mtxsmd_t));
#if 0
    int hor[4];
    int ver;
#endif

#if 0
    int_x = ref_pos_x >> 2;
    int_y = ref_pos_y >> 2;
#endif
    int frac_x = ref_pos_x & 3;
    int frac_y = ref_pos_y & 3;

    // int_buf fill-up function is here. it's too long but is not the bottleneck
#if INT_LINEAR
    int int_buf[35 * 35];               // 1+32+2 = 35
    int_fill(int_buf);
#else
    const int *int_buf = int_dft;
#endif

#if 0
    // pixel[v][h] = int_buf[(y + v) * (blk_size + 3) + (x + h)];
    mtxsmd_pc pixptr = pixel_vector(0,blk_size,int_buf);
#endif

    BEG;

    mtxsmd_t coef_x = mtxsmdlda(&c_coef_4tap[frac_x]);
    mtxsmd_t coef_y = mtxsmdlda(&c_coef_4tap[frac_y]);

    for (int y = 0; y < blk_size; y++) {
        for (int x = 0; x < blk_size; x++) {
            for (int v = 0; v < 4; v++) {
                for (int h = 0; h < 4; h++)
                    pixel[v][h] = int_buf[(y + v) * (blk_size + 3) + (x + h)];
            }
#if 0
            // 4x4 filtering
                hor[v] = (pixel[0] * coef_x[0] +
                    pixel[1] * coef_x[1] +
                    pixel[2] * coef_x[2] +
                    pixel[3] * coef_x[3] + 32) >> 6;
                hor[v] = I_CLIP3(0, 255, hor[v]);
#endif

            mtxsmdcvt_t cvt;
            for (int v = 0; v < 4; v++) {
                mtxsmd_t hor_v = mtxsmdlda(pixel[v]);
                hor_v = mtxsmdmulv(hor_v,coef_x);

                mtxdat_t hsum = mtxsmdsumh(hor_v);
                hsum += 32;
                hsum >>= 6;
                hsum = I_CLIP3(0, 255, hsum);

                cvt.cvt_dat[v] = hsum;
            }

#if 0
            ver = (hor[0] * coef_y][0] +
                hor[1] * coef_y[1] +
                hor[2] * coef_y[2] +
                hor[3] * coef_y[3] + 32) >> 6;
            ver = I_CLIP3(0, 255, ver);
#endif

            mtxsmd_t acc = mtxsmdmulv(cvt.cvt_smd,coef_y);

            mtxdat_t ver = mtxsmdsumh(acc);
            ver += 32;
            ver >>= 6;
            ver = I_CLIP3(0, 255, ver);

            ref_buf[y * blk_size + x] = ver;
        }
    }

    END;
}

void
s_simd_2(int list, int ref_pos_x, int ref_pos_y, int blk_size,
    int ref_buf[])
{
#if 0
    int pixel[4][4] ALIGNAS(sizeof(mtxsmd_t));
#else
    mtxsmd_t pixel[4] ALIGNAS(sizeof(mtxsmd_t));
#endif
#if 0
    int hor[4];
    int ver;
#endif

#if 0
    int_x = ref_pos_x >> 2;
    int_y = ref_pos_y >> 2;
#endif
    int frac_x = ref_pos_x & 3;
    int frac_y = ref_pos_y & 3;

    // int_buf fill-up function is here. it's too long but is not the bottleneck
#if INT_LINEAR
    int int_buf[35 * 35];               // 1+32+2 = 35
    int_fill(int_buf);
#else
    const int *int_buf = int_dft;
#endif

#if 0
    // pixel[v][h] = int_buf[(y + v) * (blk_size + 3) + (x + h)];
    mtxsmd_pc pixptr = pixel_vector(0,blk_size,int_buf);
#endif

    BEG;

    mtxsmd_t coef_x = mtxsmdlda(&c_coef_4tap[frac_x]);
    mtxsmd_t coef_y = mtxsmdlda(&c_coef_4tap[frac_y]);

    for (int y = 0; y < blk_size; y++) {
        for (int x = 0; x < blk_size; x++) {
            for (int v = 0; v < 4; v++) {
#if 0
                for (int h = 0; h < 4; h++)
                    pixel[v][h] = int_buf[(y + v) * (blk_size + 3) + (x + h)];
#else
                mtxsmd_pc src = CAST(mtxsmd_pc,
                    &int_buf[(y + v) * (blk_size + 3) + x]);
                pixel[v] = mtxsmdldu(src);
#endif
            }
#if 0
            // 4x4 filtering
                hor[v] = (pixel[0] * coef_x[0] +
                    pixel[1] * coef_x[1] +
                    pixel[2] * coef_x[2] +
                    pixel[3] * coef_x[3] + 32) >> 6;
                hor[v] = I_CLIP3(0, 255, hor[v]);
#endif

            mtxsmdcvt_t cvt;
            for (int v = 0; v < 4; v++) {
                mtxsmd_t hor_v = mtxsmdlda(&pixel[v]);
                hor_v = mtxsmdmulv(hor_v,coef_x);

                mtxdat_t hsum = mtxsmdsumh(hor_v);
                hsum += 32;
                hsum >>= 6;
                hsum = I_CLIP3(0, 255, hsum);

                cvt.cvt_dat[v] = hsum;
            }

#if 0
            ver = (hor[0] * coef_y][0] +
                hor[1] * coef_y[1] +
                hor[2] * coef_y[2] +
                hor[3] * coef_y[3] + 32) >> 6;
            ver = I_CLIP3(0, 255, ver);
#endif

            mtxsmd_t acc = mtxsmdmulv(cvt.cvt_smd,coef_y);

            mtxdat_t ver = mtxsmdsumh(acc);
            ver += 32;
            ver >>= 6;
            ver = I_CLIP3(0, 255, ver);

            ref_buf[y * blk_size + x] = ver;
        }
    }

    END;
}

void
s_simd_3(int list, int ref_pos_x, int ref_pos_y, int blk_size,
    int ref_buf[])
{
#if 0
    //int pixel[4];
    //int hor[4];
    //int ver;
#endif

#if 0
    int_x = ref_pos_x >> 2;
    int_y = ref_pos_y >> 2;
#endif
    int frac_x = ref_pos_x & 3;
    int frac_y = ref_pos_y & 3;

    // int_buf fill-up function is here. it's too long but is not the bottleneck
#if INT_LINEAR
    int int_buf[35 * 35];               // 1+32+2 = 35
    int_fill(int_buf);
#else
    const int *int_buf = int_dft;
#endif

    // pixel[v][h] = int_buf[(y + v) * (blk_size + 3) + (x + h)];
    mtxsmd_pc pixptr = pixel_vector(0,blk_size,int_buf);

    BEG;

    mtxsmd_t coef_x = mtxsmdlda(&c_coef_4tap[frac_x]);
    mtxsmd_t coef_y = mtxsmdlda(&c_coef_4tap[frac_y]);

    for (int y = 0; y < blk_size; y++) {
        for (int x = 0; x < blk_size; x++) {
#if 0
            // 4x4 filtering
                hor[v] = (pixel[0] * coef_x[0] +
                    pixel[1] * coef_x[1] +
                    pixel[2] * coef_x[2] +
                    pixel[3] * coef_x[3] + 32) >> 6;
                hor[v] = I_CLIP3(0, 255, hor[v]);
#endif

            mtxsmdcvt_t cvt;
            for (int v = 0; v < 4; v++, pixptr += 1) {
                mtxsmd_t hor_v = mtxsmdlda(pixptr);
                hor_v = mtxsmdmulv(hor_v,coef_x);

                mtxdat_t hsum = mtxsmdsumh(hor_v);
                hsum += 32;
                hsum >>= 6;
                hsum = I_CLIP3(0, 255, hsum);

                cvt.cvt_dat[v] = hsum;
            }

#if 0
            ver = (hor[0] * coef_y][0] +
                hor[1] * coef_y[1] +
                hor[2] * coef_y[2] +
                hor[3] * coef_y[3] + 32) >> 6;
            ver = I_CLIP3(0, 255, ver);
#endif

            mtxsmd_t acc = mtxsmdmulv(cvt.cvt_smd,coef_y);

            mtxdat_t ver = mtxsmdsumh(acc);
            ver += 32;
            ver >>= 6;
            ver = I_CLIP3(0, 255, ver);

            ref_buf[y * blk_size + x] = ver;
        }
    }

    END;
}

void
s_fix1(int list, int ref_pos_x, int ref_pos_y, int blk_size,
    int ref_buf[])
{
    int int_buf[35 * 35];               // 1+32+2 = 35

#if 0
    int_x = ref_pos_x >> 2;
    int_y = ref_pos_y >> 2;
#endif
    int frac_x = ref_pos_x & 3;
    int frac_y = ref_pos_y & 3;

    // int_buf fill-up function is here. it's too long but is not the bottleneck
    int_fill(int_buf);

    const int *coef_x = c_coef_4tap[frac_x];
    const int *coef_y = c_coef_4tap[frac_y];

    BEG;

    for (int y = 0; y < blk_size; y++) {
        for (int x = 0; x < blk_size; x++) {
            int ver = 0;

            for (int v = 0; v < 4; v++) {
                int hor_h = 0;

                for (int h = 0; h < 4; h++) {
                    int pixel = int_buf[(y + v) * (blk_size + 3) + (x + h)];
                    hor_h += pixel * coef_x[h];
                }

                hor_h += 32;
                hor_h >>= 6;
                hor_h = I_CLIP3(0, 255, hor_h);

                ver += hor_h * coef_y[v];
            }

            ver += 32;
            ver >>= 6;

            ref_buf[y * blk_size + x] = I_CLIP3(0, 255, ver);
        }
    }

    END;
}

#define FNCDEF(_sym,_reason) \
    .fnc_ptr = (void *) _sym, .fnc_sym = #_sym, .fnc_reason = _reason

fncdef_t fnclist[] = {
    { FNCDEF(pixel_vector,"pixel vector caching"),
        .fnc_opt = FNCTYP_PIXVEC | FNCOPT_BASELINE },
    { FNCDEF(pixel_vecsmd,"pixel vector caching (simd)"),
        .fnc_opt = FNCTYP_PIXVEC },
    { FNCDEF(s_orig,"OP's original code"),
        .fnc_opt = FNCOPT_BASELINE },
    { FNCDEF(s_fix1,"Craig's cleanup") },
    { FNCDEF(s_simd_1,"Craig's simd (pixel 2D)") },
    { FNCDEF(s_simd_2,"Craig's simd (pixel 2D wide int_buf loading)") },
    { FNCDEF(s_simd_3,"Craig's simd (pixel vector)") },
};

#define FNCALL(_fnc) \
    _fnc = &fnclist[0];  _fnc < &fnclist[countof(fnclist)];  ++_fnc

#define FNCREV(_ref,_fnc) \
    _ref = _fnc - 1;  _ref >= &fnclist[0];  --_ref

void
docmp(fncdef_pc ref,fncdef_pc fnc,int blk_size)
{
    const int *ref_orig = ref->fnc_ref;
    const int *ref_fast = fnc->fnc_ref;
    int err = 0;

    switch (FNCTYPE(fnc)) {
    case FNCTYP_PIXVEC:
        blk_size += 3;
        break;
    }

    for (int y = 0;  y < blk_size;  ++y) {
        for (int x = 0;  x < blk_size;  ++x) {
            int orig = ref_orig[(y * blk_size) + x];
            int fast = ref_fast[(y * blk_size) + x];
            if (fast != orig) {
                printf("MISMATCH [%3d][%3d] %4d %4d (%s)\n",
                    y,x,orig,fast,fnc->fnc_sym);
                err += 1;
            }
        }
    }
}

void
dofnc(fncdef_p fnc,
    int list, int ref_pos_x, int ref_pos_y, int blk_size)
{

    fnc->fnc_elap = 1LL << 62;

    switch (FNCTYPE(fnc)) {
    case FNCTYP_NORMAL:
        fnc->fnc_ref = memalign(64,ref_siz);
        fnc->fnc_opt |= FNCOPT_ALLOC;
        break;
    }

    for (int iter = 1;  iter <= 5;  ++iter) {
        switch (FNCTYPE(fnc)) {
        case FNCTYP_PIXVEC:
            fnc->fnc_ref = (int *) fnc->fnc_vec(1,blk_size,int_dft);
            break;
        case FNCTYP_NORMAL:
            fnc->fnc_ptr(list,ref_pos_x,ref_pos_y,blk_size,fnc->fnc_ref);
            break;
        }

        if (t_dif < fnc->fnc_elap)
            fnc->fnc_elap = t_dif;
    }

    // compare against reference
    fncdef_p ref;
    for (FNCREV(ref,fnc)) {
        if ((ref->fnc_opt & FNCOPT_BASELINE) == 0)
            continue;
        if (FNCTYPE(ref) == FNCTYPE(fnc))
            docmp(ref,fnc,blk_size);
    }
}

int
cmpinc(const void *vlhs,const void *vrhs)
{
    fncdef_p *flhs = (void *) vlhs;
    fncdef_p *frhs = (void *) vrhs;
    int cmp = UCMP((*flhs)->fnc_elap,(*frhs)->fnc_elap);
    return cmp;
}

void
dotest(int list,int ref_pos_x,int ref_pos_y,int blk_size)
{
    fncdef_p fnc;
    fncdef_p *ind;
    fncdef_p indlist[countof(fnclist)];

    printf("dotest: ref_pos_x=%d ref_pos_y=%d blk_size=%d\n",
        ref_pos_x,ref_pos_y,blk_size);

    ref_siz = sizeof(int) * blk_size * blk_size;

    for (FNCALL(fnc)) {
        dofnc(fnc,list,ref_pos_x,ref_pos_y,blk_size);
        indlist[fnc - fnclist] = fnc;
    }

    if (opt_s)
        qsort(indlist,countof(indlist),sizeof(fncdef_p),cmpinc);

    for (ind = &indlist[0];  ind < &indlist[countof(indlist)];  ++ind)
        showfnc(*ind);

    for (FNCALL(fnc)) {
        if (fnc->fnc_opt & FNCOPT_ALLOC)
            free(fnc->fnc_ref);
        fnc->fnc_ref = NULL;
    }
}

int
main(int argc,char **argv)
{

    setlinebuf(stdout);
    setlinebuf(stderr);

    --argc;
    ++argv;

    for (;  argc > 0;  --argc, ++argv) {
        char *cp = *argv;
        if (*cp != '-')
            break;

        cp += 2;
        switch(cp[-1]) {
        case 's':  // do sort
            opt_s = ! opt_s;
            break;
        }
    }

    //dotest(0,35,0,16);
    dotest(0,35,0,32);

    return 0;
}

FILE: bldhor2

#!/usr/bin/perl
# bldhor2 -- build and run benchmark program

master(@ARGV);
exit(0);

sub master
{
    my(@argv) = @_;

    $sym_lookup{"D"} = 1;
    $sym_lookup{"gdb"} = 0;
    $sym_lookup{"s"} = 0;
    $sym_lookup{"S"} = 0;
    $sym_lookup{"f"} = 1;
    @syms = reverse(sort(keys(%sym_lookup)));

    while (@argv > 0) {
        $opt = $argv[0];
        last unless ($opt =~ s/^-//);
        shift(@argv);

        undef($sym);
        foreach $rhs (@syms) {
            if ($opt =~ /^($rhs)=?(.*)$/) {
                ($sym,$val) = ($1,$2);
                last;
            }
        }
        sysfault("master: unknown option -- '%s'\n",$opt)
            unless (defined($sym));

        $aryflg = $sym_lookup{$sym};
        $sym = "opt_" . $sym;

        $val = 1
            if ($val eq "");

        if ($aryflg) {
            push(@$sym,$val);
        }
        else {
            $$sym = $val;
        }
    }

    $src //= "simd";

    $opt_cc //= "cc";

    if (-e "bigmat.h") {
        push(@opt_D,"HAVE_BIGMAT_H=1");
    }

    $objdir = "OBJ";
    system("rm","-fr",$objdir);
    mkdir($objdir);

    foreach $opt (@opt_D) {
        $opt = "-D$opt";
    }

    {
        @W = qw(-Wall -Werror -Wno-unknown-pragmas);

        clg("x0",0);
        last if ($opt_gdb);

        clg("x2",2);
        clg("x3",3);

        if (0) {
            clg("omp",3,@W,@opt_D);
            ###clg(qw(simd -O3 -Wall -Werror -fopenmp -fopenmp-simd));
            ###clg(qw(fork -O3 -Wall -Werror -fopenmp -DDOFORK));
        }
    }
}

sub clg
{
    my($root,$Olvl) = @_;
    my(@cflags);

    printf("\n")
        if ($sep);
    $sep = 1;

    shift(@_);
    shift(@_);

    printf("\n");
    printf("%s\n","-" x 80);
    printf("clg: %s ...\n",$root);

    push(@cflags,@W,@opt_D,@opt_f);
    if ($opt_S) {
        push(@cflags,"-S");
        push(@cflags,"-fverbose-asm");
    }
    push(@cflags,"-I.");
    ###push(@cflags,"-I/home/cae/preserve/ovrbnc");
    push(@cflags,"-march=native");
    ###push(@cflags,"-fsanitize=address");

    push(@cflags,"-O$Olvl")
        unless ($opt_gdb);

    $xfile = "$objdir/$root";

    ###$root = $src;
    do_system($opt_cc,"-o",$xfile,"$src.c","-g",@cflags,@_);
    exit($code) if ($code);

    {
        last if ($opt_S);

        my(@cmd);
        push(@cmd,$xfile);
        push(@cmd,"-s")
            if ($opt_s);

        unshift(@cmd,"qgdb")
            if ($opt_gdb);

        do_system(@cmd);
    }
}

sub do_system
{
    my($cmd);
    my($signo,$status,$code);

    $cmd = join(" ",@_);
    printf("%s\n",$cmd);
    system($cmd);

    $status = $?;

    {
        if ($status & 0x80) {
            $signo = $status & 0x7F;
            printf("do_system: signal %d\n",$signo);
        }
        else {
            $code = $status >> 8;
            printf("do_system: code %d\n",$code)
                if ($code);
        }
    }

    exit(1) if ($code or $signo);
}

sub sysfault
{

    printf(STDERR @_);
    exit(1);
}

Here is the output of the new program:

--------------------------------------------------------------------------------
clg: x0 ...
cc -o OBJ/x0 simd.c -g -Wall -Werror -Wno-unknown-pragmas -DINT_LINEAR -DHAVE_BIGMAT_H=1 -I. -march=native -O0
OBJ/x0 -s
dotest: ref_pos_x=35 ref_pos_y=0 blk_size=32
ELAPSED: 0.000029549 pixel_vecsmd (pixel vector caching (simd))
ELAPSED: 0.000075990 pixel_vector (pixel vector caching)
ELAPSED: 0.000118173 s_simd_3 (Craig's simd (pixel vector))
ELAPSED: 0.000137634 s_fix1 (Craig's cleanup)
ELAPSED: 0.000139114 s_orig (OP's original code)
ELAPSED: 0.000146076 s_simd_2 (Craig's simd (pixel 2D wide int_buf loading))
ELAPSED: 0.000193606 s_simd_1 (Craig's simd (pixel 2D))

--------------------------------------------------------------------------------
clg: x2 ...
cc -o OBJ/x2 simd.c -g -Wall -Werror -Wno-unknown-pragmas -DINT_LINEAR -DHAVE_BIGMAT_H=1 -I. -march=native -O2
OBJ/x2 -s
dotest: ref_pos_x=35 ref_pos_y=0 blk_size=32
ELAPSED: 0.000003430 pixel_vecsmd (pixel vector caching (simd))
ELAPSED: 0.000020585 pixel_vector (pixel vector caching)
ELAPSED: 0.000024393 s_fix1 (Craig's cleanup)
ELAPSED: 0.000025228 s_simd_3 (Craig's simd (pixel vector))
ELAPSED: 0.000027066 s_simd_2 (Craig's simd (pixel 2D wide int_buf loading))
ELAPSED: 0.000028777 s_orig (OP's original code)
ELAPSED: 0.000041119 s_simd_1 (Craig's simd (pixel 2D))

--------------------------------------------------------------------------------
clg: x3 ...
cc -o OBJ/x3 simd.c -g -Wall -Werror -Wno-unknown-pragmas -DINT_LINEAR -DHAVE_BIGMAT_H=1 -I. -march=native -O3
OBJ/x3 -s
dotest: ref_pos_x=35 ref_pos_y=0 blk_size=32
ELAPSED: 0.000002626 pixel_vecsmd (pixel vector caching (simd))
ELAPSED: 0.000004767 s_orig (OP's original code)
ELAPSED: 0.000005040 s_fix1 (Craig's cleanup)
ELAPSED: 0.000006426 pixel_vector (pixel vector caching)
ELAPSED: 0.000018252 s_simd_3 (Craig's simd (pixel vector))
ELAPSED: 0.000018959 s_simd_2 (Craig's simd (pixel 2D wide int_buf loading))
ELAPSED: 0.000019993 s_simd_1 (Craig's simd (pixel 2D))

Note:

There may be a better way to horizontally sum a SIMD number.

This is the mtxsmdsumh function in the code above. It uses 4 _mm_extract_epi32 intrinsic calls (i.e. the pextrd instruction).
I have up to sse4.2 on my machine but nothing later (i.e. no vex or avx).

I scoured the documentation [until my eyes glazed over :-)] but couldn't find anything faster/better. Some things I looked at (and rejected):

_mm_maddubs_epi16 (pmaddubsw)
_mm_madd_epi16 (pmaddwd)
_mm_hadd_epi32 (phaddd)

Upvotes: 1

Craig Estey

Reputation: 33631

Prefaced by the top comments ...

Okay, I've synthesized a benchmark program.

One reason to change pixel and hor to scalars is that it made the conversion to openmp easier.

The first naive attempt to use -fopenmp on original code resulted in incorrect results. My "cleaned up" function was easier to convert to OMP.

With some simple refactoring, I was able to get openmp to work on the original code.

Detailed benchmarks are below, but ...

OMP results in a 2x speedup.
With -O2, my version was somewhat faster (17 %).
With -O3, the versions were nearly the same.

Here is the refactored code:

// bench.c -- benchmark program

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define inline_always static inline __attribute__((always_inline))

double
tscgetf(void)
{
    struct timespec ts;
    double sec;

    clock_gettime(CLOCK_MONOTONIC,&ts);
    sec = ts.tv_nsec;
    sec /= 1e9;
    sec += ts.tv_sec;

    return sec;
}

double t_beg;
double t_end;

#define BEG \
    t_beg = tscgetf()

#define END \
    t_end = tscgetf()

#if 0
static int c_coef_4tap[4][4] = {
    0, 64, 0, 0,
    -4, 54, 16, -2,
    -4, 36, 36, -4,
    -2, 16, 54, -4,
};
#else
const int c_coef_4tap[4][4] = {
    { 0, 64, 0, 0 },
    { -4, 54, 16, -2 },
    { -4, 36, 36, -4 },
    { -2, 16, 54, -4 },
};
#endif

typedef void
(*fncptr)(int list, int ref_pos_x, int ref_pos_y, int blk_size, int ref_buf[]);

struct fncdef {
    fncptr fnc_ptr;
    const char *fnc_sym;
    const char *fnc_reason;
    int *fnc_ref;
    double fnc_elap;
};

inline_always int
I_CLIP3(int lo,int hi,int val)
{

    do {
        if (val < lo) {
            val = lo;
            break;
        }

        if (val > hi) {
            val = hi;
            break;
        }
    } while (0);

    return val;
}

void
int_fill(int *int_buf)
{

    for (int idx = 0;  idx < (35 * 35);  ++idx)
        int_buf[idx] = idx;
}

void
s_orig(int list, int ref_pos_x, int ref_pos_y, int blk_size,
    int ref_buf[])
{
    int int_buf[35 * 35];               // 1+32+2 = 35
#if 0
    int int_x, int_y;
#endif
    int frac_x, frac_y;
    int pixel[4][4];
    int hor[4];
    int ver;
    int x, y, h, v;

#if 0
    int_x = ref_pos_x >> 2;
    int_y = ref_pos_y >> 2;
#endif
    frac_x = ref_pos_x & 3;
    frac_y = ref_pos_y & 3;

    // int_buf fill-up function is here. it's too long but is not the bottleneck
    int_fill(int_buf);

    BEG;

    for (y = 0; y < blk_size; y++) {
        for (x = 0; x < blk_size; x++) {
            for (v = 0; v < 4; v++) {
                for (h = 0; h < 4; h++) {
                    pixel[v][h] = int_buf[(y + v) * (blk_size + 3) + (x + h)];
                }
            }

            // 4x4 filtering
            for (v = 0; v < 4; v++) {
                hor[v] = (pixel[v][0] * c_coef_4tap[frac_x][0] +
                    pixel[v][1] * c_coef_4tap[frac_x][1] +
                    pixel[v][2] * c_coef_4tap[frac_x][2] +
                    pixel[v][3] * c_coef_4tap[frac_x][3] + 32) >> 6;
                hor[v] = I_CLIP3(0, 255, hor[v]);
            }

            ver = (hor[0] * c_coef_4tap[frac_y][0] +
                hor[1] * c_coef_4tap[frac_y][1] +
                hor[2] * c_coef_4tap[frac_y][2] +
                hor[3] * c_coef_4tap[frac_y][3] + 32) >> 6;

            ref_buf[y * blk_size + x] = I_CLIP3(0, 255, ver);
        }
    }

    END;
}

void
s_orig_omp1(int list, int ref_pos_x, int ref_pos_y, int blk_size,
    int ref_buf[])
{
    int int_buf[35 * 35];               // 1+32+2 = 35
#if 0
    int int_x, int_y;
#endif
    int frac_x, frac_y;
    int pixel[4][4];
    int hor[4];
    int ver;
    int x, y, h, v;

#if 0
    int_x = ref_pos_x >> 2;
    int_y = ref_pos_y >> 2;
#endif
    frac_x = ref_pos_x & 3;
    frac_y = ref_pos_y & 3;

    // int_buf fill-up function is here. it's too long but is not the bottleneck
    int_fill(int_buf);

    BEG;

#pragma omp parallel for
    for (y = 0; y < blk_size; y++) {
        for (x = 0; x < blk_size; x++) {
            for (v = 0; v < 4; v++) {
                for (h = 0; h < 4; h++) {
                    pixel[v][h] = int_buf[(y + v) * (blk_size + 3) + (x + h)];
                }
            }

            // 4x4 filtering
            for (v = 0; v < 4; v++) {
                hor[v] = (pixel[v][0] * c_coef_4tap[frac_x][0] +
                    pixel[v][1] * c_coef_4tap[frac_x][1] +
                    pixel[v][2] * c_coef_4tap[frac_x][2] +
                    pixel[v][3] * c_coef_4tap[frac_x][3] + 32) >> 6;
                hor[v] = I_CLIP3(0, 255, hor[v]);
            }

            ver = (hor[0] * c_coef_4tap[frac_y][0] +
                hor[1] * c_coef_4tap[frac_y][1] +
                hor[2] * c_coef_4tap[frac_y][2] +
                hor[3] * c_coef_4tap[frac_y][3] + 32) >> 6;

            ref_buf[y * blk_size + x] = I_CLIP3(0, 255, ver);
        }
    }

    END;
}

void
s_orig_omp2(int list, int ref_pos_x, int ref_pos_y, int blk_size,
    int ref_buf[])
{
    int int_buf[35 * 35];               // 1+32+2 = 35
#if 0
    int int_x, int_y;
#endif
    int frac_x, frac_y;
    int pixel[4][4];
    int hor[4];

#if 0
    int_x = ref_pos_x >> 2;
    int_y = ref_pos_y >> 2;
#endif
    frac_x = ref_pos_x & 3;
    frac_y = ref_pos_y & 3;

    // int_buf fill-up function is here. it's too long but is not the bottleneck
    int_fill(int_buf);

    BEG;

#pragma omp parallel for private(pixel,hor) shared(int_buf,frac_x,frac_y)
    for (int y = 0; y < blk_size; y++) {
        for (int x = 0; x < blk_size; x++) {
            for (int v = 0; v < 4; v++) {
                for (int h = 0; h < 4; h++) {
                    pixel[v][h] = int_buf[(y + v) * (blk_size + 3) + (x + h)];
                }
            }

            // 4x4 filtering
            for (int v = 0; v < 4; v++) {
                hor[v] = (pixel[v][0] * c_coef_4tap[frac_x][0] +
                    pixel[v][1] * c_coef_4tap[frac_x][1] +
                    pixel[v][2] * c_coef_4tap[frac_x][2] +
                    pixel[v][3] * c_coef_4tap[frac_x][3] + 32) >> 6;
                hor[v] = I_CLIP3(0, 255, hor[v]);
            }

            int ver = (hor[0] * c_coef_4tap[frac_y][0] +
                hor[1] * c_coef_4tap[frac_y][1] +
                hor[2] * c_coef_4tap[frac_y][2] +
                hor[3] * c_coef_4tap[frac_y][3] + 32) >> 6;

            ref_buf[y * blk_size + x] = I_CLIP3(0, 255, ver);
        }
    }

    END;
}

void
s_fix1(int list, int ref_pos_x, int ref_pos_y, int blk_size,
    int ref_buf[])
{
    int int_buf[35 * 35];               // 1+32+2 = 35
#if 0
    int int_x, int_y;
#endif
    int frac_x, frac_y;
    int x, y, h, v;

#if 0
    int_x = ref_pos_x >> 2;
    int_y = ref_pos_y >> 2;
#endif
    frac_x = ref_pos_x & 3;
    frac_y = ref_pos_y & 3;

    // int_buf fill-up function is here. it's too long but is not the bottleneck
    int_fill(int_buf);

    BEG;

    for (y = 0; y < blk_size; y++) {
        for (x = 0; x < blk_size; x++) {
            int ver = 0;

            for (v = 0; v < 4; v++) {
                int hor_h = 0;

                for (h = 0; h < 4; h++) {
                    int pixel = int_buf[(y + v) * (blk_size + 3) + (x + h)];
                    hor_h += pixel * c_coef_4tap[frac_x][h];
                }

                hor_h += 32;
                hor_h >>= 6;
                hor_h = I_CLIP3(0, 255, hor_h);

                ver += hor_h * c_coef_4tap[frac_y][v];
            }

            ver += 32;
            ver >>= 6;

            ref_buf[y * blk_size + x] = I_CLIP3(0, 255, ver);
        }
    }

    END;
}

void
s_fix1_omp(int list, int ref_pos_x, int ref_pos_y, int blk_size,
    int ref_buf[])
{
    int int_buf[35 * 35];               // 1+32+2 = 35
#if 0
    int int_x, int_y;
#endif
    int frac_x, frac_y;
#if 0
    int x, y, h, v;
#endif

#if 0
    int_x = ref_pos_x >> 2;
    int_y = ref_pos_y >> 2;
#endif
    frac_x = ref_pos_x & 3;
    frac_y = ref_pos_y & 3;

    // int_buf fill-up function is here. it's too long but is not the bottleneck
    int_fill(int_buf);

    BEG;

#pragma omp parallel for shared(int_buf,frac_x,frac_y)
    for (int y = 0; y < blk_size; y++) {
        for (int x = 0; x < blk_size; x++) {
            int ver = 0;

            for (int v = 0; v < 4; v++) {
                int hor_h = 0;

                for (int h = 0; h < 4; h++) {
                    int pixel = int_buf[(y + v) * (blk_size + 3) + (x + h)];
                    hor_h += pixel * c_coef_4tap[frac_x][h];
                }

                hor_h += 32;
                hor_h >>= 6;
                hor_h = I_CLIP3(0, 255, hor_h);

                ver += hor_h * c_coef_4tap[frac_y][v];
            }

            ver += 32;
            ver >>= 6;

            ref_buf[y * blk_size + x] = I_CLIP3(0, 255, ver);
        }
    }

    END;
}

#define FNCDEF(_sym,_reason) \
    .fnc_ptr = _sym, .fnc_sym = #_sym, .fnc_reason = _reason

struct fncdef fnclist[] = {
    { FNCDEF(s_orig,"OP's original code") },
#ifdef _OPENMP
    { FNCDEF(s_orig_omp1,"OP's -fopenmp code [BROKEN]") },
    { FNCDEF(s_orig_omp2,"OP's -fopenmp code (with fixes)") },
#endif
    { FNCDEF(s_fix1,"Craig's cleanup") },
#ifdef _OPENMP
    { FNCDEF(s_fix1_omp,"Craig's -fopenmp") },
#endif
    { .fnc_ptr = NULL }
};

#define FNCALL \
    struct fncdef *fnc = fnclist;  fnc->fnc_ptr != NULL;  ++fnc

void
dofnc(struct fncdef *fnc,
    int list, int ref_pos_x, int ref_pos_y, int blk_size)
{
    double t_dif;

    fnc->fnc_elap = 1e9;

    for (int iter = 1;  iter <= 5;  ++iter) {
        fnc->fnc_ptr(list,ref_pos_x,ref_pos_y,blk_size,fnc->fnc_ref);
        t_dif = t_end - t_beg;
        if (t_dif < fnc->fnc_elap)
            fnc->fnc_elap = t_dif;
    }

    printf("ELAPSED: %.9f %s (%s)\n",
        fnc->fnc_elap,fnc->fnc_sym,fnc->fnc_reason);
}

void
dotest(int list,int ref_pos_x,int ref_pos_y,int blk_size)
{

    for (FNCALL) {
        fnc->fnc_ref = malloc(sizeof(int) * blk_size * blk_size);
        dofnc(fnc,list,ref_pos_x,ref_pos_y,blk_size);
    }

    struct fncdef *ref = NULL;
    for (FNCALL) {
        if (ref == NULL) {
            ref = fnc;
            continue;
        }

        const int *ref_orig = ref->fnc_ref;
        const int *ref_fast = fnc->fnc_ref;

        printf("dotest: COMPARE %s\n",fnc->fnc_sym);

        for (int y = 0;  y < blk_size;  ++y) {
            for (int x = 0;  x < blk_size;  ++x) {
                int orig = ref_orig[(y * blk_size) + x];
                int fast = ref_fast[(y * blk_size) + x];
                if (fast != orig) {
                    printf("dotest: [%3d][%3d] %4d %4d\n",y,x,orig,fast);
                    //exit(1);
                }
            }
        }
    }

    for (FNCALL)
        free(fnc->fnc_ref);
}

int
main(void)
{

    //dotest(0,35,0,16);
    dotest(0,35,0,32);

    return 0;
}

In the code above, I've used cpp conditionals to denote old vs. new code:

#if 0
// old code
#else
// new code
#endif

#if 1
// new code
#endif

Note: this can be cleaned up by running the file through unifdef -k

Here are the commands I used:

cc -o sO0 bench.c -O0 -Wall -Werror -Wno-unknown-pragmas
./sO0
ELAPSED: 0.000141978 s_orig (OP's original code)
ELAPSED: 0.000132095 s_fix1 (Craig's cleanup)
dotest: COMPARE s_fix1

cc -o sO2 bench.c -O2 -Wall -Werror -Wno-unknown-pragmas
./sO2
ELAPSED: 0.000027120 s_orig (OP's original code)
ELAPSED: 0.000023231 s_fix1 (Craig's cleanup)
dotest: COMPARE s_fix1

cc -o std bench.c -O3 -Wall -Werror -Wno-unknown-pragmas
./std
ELAPSED: 0.000010765 s_orig (OP's original code)
ELAPSED: 0.000010280 s_fix1 (Craig's cleanup)
dotest: COMPARE s_fix1

cc -o omp bench.c -O3 -Wall -Werror -fopenmp
./omp
ELAPSED: 0.000010361 s_orig (OP's original code)
ELAPSED: 0.000022512 s_orig_omp1 (OP's -fopenmp code [BROKEN])
ELAPSED: 0.000004398 s_orig_omp2 (OP's -fopenmp code (with fixes))
ELAPSED: 0.000015385 s_fix1 (Craig's cleanup)
ELAPSED: 0.000004432 s_fix1_omp (Craig's -fopenmp)
dotest: COMPARE s_orig_omp1
dotest: [  0][ 14]   51  255
dotest: [  3][ 17]  159  255
dotest: [  3][ 23]  165  255
dotest: [  5][  0]  212  255
dotest: [  8][ 18]  255  190
dotest: [ 12][  0]  255  181
dotest: [ 16][  0]  255   73
dotest: [ 16][ 13]  255  241
dotest: [ 20][  8]  255   38
dotest: [ 22][  0]  255  100
dotest: [ 24][ 12]  255   44
dotest: [ 25][ 21]  255  241
dotest: [ 25][ 22]  255  247
dotest: [ 28][ 21]  255   66
dotest: [ 28][ 22]  255  219
dotest: [ 28][ 23]  255  221
dotest: [ 28][ 25]  255  228
dotest: [ 29][  0]  255   89
dotest: [ 30][  0]  255  125
dotest: [ 31][  1]  255  169
dotest: COMPARE s_orig_omp2
dotest: COMPARE s_fix1
dotest: COMPARE s_fix1_omp

UPDATE:

A curiosity ...

I noticed a discrepancy in the timings of s_fix1 with -O3:

Without -fopenmp, it is 0.000010339
With -fopenmp, it is: 0.000015357

Since s_fix1 has no #pragma omp directives, I would expect them to be the same. But the -fopenmp version is 30% slower!

I tried with clang and got a similar result.

My compilers (on x86_64):

gcc 8.3.1
clang 7.0.1

Then, it hit me ...

With -fopenmp, after doing the first test, it does two omp tests and then does s_fix1!

When I redid the test, changing the order of the tests in the table, doing s_orig and then s_fix1 immediately, I got the expected results.

Even though s_fix1 is non-omp, it was affected by running tests that did use omp.

I'm a comparative newbie to openmp, so, apparently, I don't know how to shut down the omp threads once they're started.

The omp threads lingered and interfered with the subsequent non-omp test.

As further confirmation, I created a version that does fork and runs the actual test in a subprocess. Then, I get the expected timing for s_fix1

Some sites say that you can't shutdown omp threads:

Force closing a thread using openMP in C++?

Fully cleaning up omp threads is useful, particularly if you're trying to benchmark omp and non-omp versions of code in the same program.

But, apparently, with openmp 4.0, it's possible with #pragma omp cancel et. al.

Upvotes: 2

How can I optimize this &quot;dot product&quot; function using SIMD? It&#39;s Mat4x4 * Vec4 but with huge strided access

Answers (2)

Related Questions

How can I optimize this "dot product" function using SIMD? It's Mat4x4 * Vec4 but with huge strided access