Felix Yah Batta Man
Felix Yah Batta Man

Reputation: 459

Optimzed FAST calculation for ARM

I want to implement a paper I found about 5ms ORB feature calculation on an ARM cortex a8 by using the arm neon library. But I'm struggling already with the FAST feature detection. So the paper I try to implement you can find here. So first of all I'm not sure with the Bright and Dark constraint. So in my understanding you have to check for FAST if there are 9 darker or 9 brighter pixel around the center pixel. So I check for both. But now I have the problem that my implementation takes in average already 3 times longer without the final shift operations to compute if it is a corner then the average computation from opencv for the whole progress. So here is my code so far maybe someone could point me to some optimizatios I could do to it.

        //detect with opncv
        Clock::time_point t0 = Clock::now();
        detectors[y]->detect(img, ocv_kps);
        Clock::time_point t1 = Clock::now();

        vector<Point2f> my_kps;
        //threshhold for FAST
        const uchar th = 8;

        int b_cnt = 0;
        int d_cnt = 0;
        //array with four possible corners to be processed in parallel
        uint32_t id_arr[4];
        uint32_t ib_arr[4];

        Clock::time_point t01 = Clock::now();
        for (int i = 3; i < img.rows - 3; i++) {
            //get pointer to seven Image rows three above and three below center and center itself
            const uchar* Mt3 = img.ptr<uchar>(i - 3);
            const uchar* Mt2 = img.ptr<uchar>(i - 2);
            const uchar* Mt1 = img.ptr<uchar>(i - 1);
            const uchar* Mc = img.ptr<uchar>(i);
            const uchar* Mb1 = img.ptr<uchar>(i + 1);
            const uchar* Mb2 = img.ptr<uchar>(i + 2);
            const uchar* Mb3 = img.ptr<uchar>(i + 3);
            for (int j = 3; j < img.cols - 3; j++) {
                const uchar j3 = j + 3;
                const uchar j2 = j + 2;
                const uchar j1 = j + 1;
                const uchar jn3 = j - 3;
                const uchar jn2 = j - 2;
                const uchar jn1 = j - 1;

                 //image values for center left right top and bottom intensity of pixel
                const uchar c = Mc[j];
                const uchar l = Mc[jn3];
                const uchar r = Mc[j3];
                const uchar t = Mt3[j];
                const uchar b = Mb3[j];

                //threshold for bright FAST constraint
                const uchar thb = c + th;

                //bools for bright constraint
                const bool cbt = t > thb;
                const bool cbb = b > thb;
                const bool cbl = l > thb;
                const bool cbr = r > thb;

                 uchar mt3;
                 uchar mt3n;
                 uchar mt2;
                 uchar mt2n;
                 uchar mt1;
                 uchar mt1n;
                 uchar mb3;
                 uchar mb3n;
                 uchar mb2;
                 uchar mb2n;
                 uchar mb1;
                 uchar mb1n;
                bool bc = false;
                //pre test do we have at least two points which fulfill bright constraint
                if ((cbl && cbt) || (cbt && cbr) || (cbr && cbb)
                        || (cbb && cbl)) {
                    bc = true;
                    //get rest of image intensity values of circle
                    mt3 = Mt3[j1];
                    mt3n = Mt3[jn1];
                    mt2 = Mt2[j2];
                    mt2n = Mt2[jn2];
                    mt1 = Mt1[j3];
                    mt1n = Mt1[jn3];
                    mb3 = Mb3[j1];
                    mb3n = Mb3[jn1];
                    mb2 = Mb2[j2];
                    mb2n = Mb2[jn2];
                    mb1 = Mb1[j3];
                    mb1n = Mb1[jn3];

                    //values for bright constrain
                    ib_arr[b_cnt] = cbt | ((mt3) > thb) << 1
                            | ((mt2) > thb) << 2 | ((mt1) > thb) << 3
                            | (cbr << 4) | ((mb1) > thb) << 5
                            | ((mb2) > thb) << 6 | ((mb3) > thb) << 7
                            | cbb << 8 | ((mb3n) > thb) << 9
                            | ((mb2n) > thb) << 10 | ((mb1n) > thb) << 11
                            | (cbl) << 12 | ((mt1n) > thb) << 13
                            | ((mt2n) > thb) << 14 | ((mt3n) > thb) << 15
                            | (cbt) << 16 | ((mt3) > thb) << 17
                            | ((mt2) > thb) << 18 | ((mt1) > thb) << 19
                            | (cbr) << 20 | ((mb1) > thb) << 21
                            | ((mb2) > thb) << 22 | ((mb3) > thb) << 23;
                    b_cnt++;
                    //if we have four possible corners in array check if they are corners
                    if (b_cnt == 4) {
                        uint32x2x4_t IB = vld4_u32(ib_arr);
                        /*
                         * here the actual shift operation would take place
                         */
                        b_cnt = 0;
                    }
                }

                //threshold for dark constraint
                const uchar thd = c - th;
                //bools for dark constraint
                const bool cdl = l < thd;
                const bool cdr = r < thd;
                const bool cdt = t < thd;
                const bool cdb = b < thd;
                //pre test do we have at least two points which fulfill dark constraint
                if ((cdl && cdt) || (cdt && cdr) || (cdr && cdb)
                        || (cdb && cdl)) {
                    //if bright pre test failed intensity values are not initialised
                    if (!bc) {
                        //get rest of image intensity values of circle
                        mt3 = Mt3[j1];
                        mt3n = Mt3[jn1];
                        mt2 = Mt2[j2];
                        mt2n = Mt2[jn2];
                        mt1 = Mt1[j3];
                        mt1n = Mt1[jn3];
                        mb3 = Mb3[j1];
                        mb3n = Mb3[jn1];
                        mb2 = Mb2[j2];
                        mb2n = Mb2[jn2];
                        mb1 = Mb1[j3];
                        mb1n = Mb1[jn3];
                    }
                    //bool values for dark constrain
                    id_arr[d_cnt] = cdt | ((mt3) < thd) << 1
                            | ((mt2) < thd) << 2 | ((mt1) < thd) << 3
                            | (cdr) << 4 | ((mb1) < thd) << 5
                            | ((mb2) < thd) << 6 | ((mb3) < thd) << 7
                            | (cdb) << 8 | ((mb3n) < thd) << 9
                            | ((mb2n) < thd) << 10 | ((mb1n) < thd) << 11
                            | (cdl) << 12 | ((mt1n) < thd) << 13
                            | ((mt2n) < thd) << 14 | ((mt3n) < thd) << 15
                            | (cdt) << 16 | ((mt3) < thd) << 17
                            | ((mt2) < thd) << 18 | ((mt1) < thd) << 19
                            | (cdr) << 20 | ((mb1) < thd) << 21
                            | ((mb2) < thd) << 22 | ((mb3) < thd) << 23;
                    d_cnt++;
                    //if we have four possible corners in array check if they are corners
                    if (d_cnt == 4) {
                        uint32x2x4_t IA = vld4_u32(id_arr);
                        /*
                         * here the actual shift operation would take place
                         */
                        d_cnt = 0;
                    }
                    int h = cdt;

                }
            }
        }
        Clock::time_point t11 = Clock::now();
        cout << "my algorithm found " << my_kps.size()
                << " and ocv found " << ocv_kps.size() <<  endl;

        microseconds ms1 = std::chrono::duration_cast < microseconds
                > (t1 - t0);
        microseconds ms2 = std::chrono::duration_cast < microseconds
                > (t11 - t01);

        rs.Push((double) ms2.count());
        cout << "my algorithm duration " << ms2.count()
                << " and ocv duration is " << ms1.count()  << endl;

Upvotes: 2

Views: 1194

Answers (2)

user364952
user364952

Reputation: 620

I have an ORB extractor that runs at 30fps on the raspberry pi.

https://github.com/0xfaded/pislam

Optimisation is really a black art, and to make matters worse ARM never released an optimisation guide for the a53. The best we have is for the a57, which likely has a similar NEON unit.

I can't really offer a full answer here, but I'll share a little about my process.

The first part of my FAST extractor loads the ring of test pixels and converts them into a 16 bit vector, just like your code does. I didn't write the asm directly but instead used gcc intrinsics. Still though, I made sure that gcc:

  1. didn't spill any registers to the stack
  2. emitted the minimal number of instructions for each comparison

You will notice that the first comparison does not isolate its bit with a mask, which would have been 0x80. This freed up a register which would otherwise be holding a constant, and it gave gcc just enough wiggle room to not spill registers.

You will also notice some fairly gruesome intrinsic usage:

  d0 = vbslq_u8(vdupq_n_u8(0x40u), vcgeq_u8(test, dark), d0);
  l0 = vbslq_u8(vdupq_n_u8(0x40u), vcleq_u8(test, light), l0);

This is equivalent to

  d0 |= test >= dark & 0x40;
  l0 |= test >= light & 0x40;

Gcc will happily compile the latter, but emit 1.5x as many instructions.

The second part was coming up with a FAST-9 test on the 16 bit vector. The below compiles to 16 instructions, but took me almost a month of on and off thinking to come up with.

  uint8x16_t t0 = vtstq_u8(d0, d1);
  uint8x16_t t1 = vtstq_u8(d0, d1);

  t0 = vbslq_u8(t0, l0, d0);
  t1 = vbslq_u8(t1, l1, d1);

  uint8x16_t cntLo = vclzq_u8(t0);
  uint8x16_t testLo = t1 << (cntLo - 1);
  asm("vceq.u8  %q0, %q0, #0" : [val] "+w" (testLo));

  uint8x16_t cntHi = vclzq_u8(t1);
  uint8x16_t testHi = t0 << (cntHi - 1);
  asm("vceq.u8  %q0, %q0, #0" : [val] "+w" (testHi));

  uint8x16_t result = (cntLo & testLo) | (cntHi & testHi);
  result = vtstq_u8(result, result);

Annoyingly, gcc wouldn't compile testLo == 0 as vceq.u8 %q0, %q0, #0, which is a special instruction for comparing to the constant zero. I wound up inserting these manually which shaved off another couple instructions.

Hope that provides some insight. Fast.h

Upvotes: 1

Felix Yah Batta Man
Felix Yah Batta Man

Reputation: 459

So after digging a bit in Arm Assembler. I came up with a code that runs at least 2 times faster on Arm then the built in OpenCv implementation of Fast9. You can check the code on GitHub. I'm very pleased about any recomandations for optimizing it. On my Raspberry Pi 3 it takes round: 1000ms for my algorithm 2000ms for OpenCv

on an 320x240 grayscale image.

Upvotes: 0

Related Questions