ARM NEON Optimization no faster than C++ Pointer Implementation

Question

I have 2 functions for splitting a YUYV frame into Y/U/V independent planes. I am doing this in order to perform format conversion from a YUYV video frame to RGBA in an OpenGL ES 2.0 Shader by uploading 3 textures containing the Y/U/V data to the GPU. One of these functions is written in C++ and one is written in ARM NEON. My target is the Cortex-A15 AM57xx Sitara.

I expected the NEON code to outperform the C++ code but they perform the same. One possibility is that I am memory I/O bound. Another possibility is that I am not great at writing NEON code..

Why do these 2 functions perform the same? Are there any glaring optimizations that could be made to either function?

Neon Function:

/// This structure is passed to ARM Assembly code
/// to split the YUV frame into seperate planes for
/// OpenGL Consumption
typedef struct {
    char *input_data;
    int input_size;
    char *y_plane;
    char *u_plane;
    char *v_plane;
} yuvSplitStruct;

void TopOpenGL::splitYuvPlanes(yuvSplitStruct *yuvStruct)
{

    __asm__ volatile(

                "PUSH {r4}
"                            /* Save callee-save registers R4 and R5 on the stack */
                "PUSH {r5}
"                            /* r1 is the pointer to the input structure ( r0 is 'this' because c++ ) */
                "ldr r0 , [r1]
"                        /* reuse r0 scratch register for the address of our frame input */
                "ldr r2 , [r1, #4]
"                    /* use r2 scratch register to store the size in bytes of the YUYV frame */
                "ldr r3 , [r1, #8]
"                    /* use r3 scratch register to store the destination Y plane address */
                "ldr r4 , [r1, #12]
"                   /* use r4 register to store the destination U plane address */
                "ldr r5 , [r1, #16]
"                   /* use r5 register to store the destination V plane address */
                "/* pld [r0, #192] PLD Does not seem to help */"
                    "mov r2, r2, lsr #5
"               /* Divide number of bytes by 32 because we process 16 pixels at a time */
                    "loopYUYV:
"
                        "vld4.8 {d0-d3}, [r0]!
"        /* Load 8 YUYV elements from our frame into d0-d3, increment frame pointer */
                        "vst2.8 {d0,d2}, [r3]!
"        /* Store both Y elements into destination y plane, increment plane pointer */
                        "vmov.F64 d0, d1
"              /* Duplicate U value */
                        "vst2.8 {d0,d1}, [r4]!
"        /* Store both U elements into destination u plane, increment plane pointer */
                        "vmov.F64 d1, d3
"              /* Duplicate V value */
                        "vst2.8 {d1,d3}, [r5]!
"        /* Store both V elements into destination v plane, increment plane pointer */
                        "subs r2, r2, #1
"              /* Decrement the loop counter */
                    "bgt loopYUYV
"                     /* Loop until entire frame is processed */
                "POP {r5}
"                             /* Restore callee-save registers */
                "POP {r4}
"
    );

}

C++ Function:

void TopOpenGL::splitYuvPlanes(unsigned char *data, int size, unsigned char *y, unsigned char *u, unsigned char *v)
{

    for ( int c = 0 ; c < ( size - 4 ) ; c+=4 ) {

        *y = *data; // Y0
        data++;
        *u = *data; // U0
        u++;
        *u = *data; // U0
        data++;
        y++;
        *y = *data; // Y1
        data++;
        *v = *data; // V0
        v++;
        *v = *data; // V0

        data++;
        y++;
        u++;
        v++;
    }

}

Peter M · Accepted Answer

This questions could involve two different factors: 1. Are Neon Instructions faster than "regular" ARM instructions? 2. Can I write better assembly than the compiler?

Are Neon Instructions faster than "regular" ARM instructions?

Your algorithm only involves loading data and storing it elsewhere. On the A15, the Load/Store pipelines in the architecture are shared for NEON register and ARM registers. This is maybe not the full picture, but any benefits that might have existed in the past for A8 and A9 which had different load/store pipelines and also different instruction issue logic, different instruction reorder and branch prediction capabilities. So on the A15 those considerations are no longer a huge factor when considering NEON instructions vs regular ARM instructions. Even back then, memcpy was faster in ARM instructions rather than NEON.

A nice intro on the now quite old A8 is http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html.

A view of the superscalar architecture on the A15: http://www.extremetech.com/wp-content/uploads/2012/11/Cortex-A15Block.jpg

Which you can compare to the A8:

http://courses.cecs.anu.edu.au/courses/ENGN8537/notes/images/processor/arm-a8-pipeline.png

Note that on the A8, the NEON is a very separated block, but on the A15 a lot of stuff is shared.

Can I write better assembly than the compiler?

Maybe, but with modern architectures, this now involves more and more deep understanding of the micro-architecture, especially for operations that are just data permuting/interleaving. If you are writing more complex data processing that actually involves multiplications, then yes, often you can do better than the compiler, in particular to tune your loop unrolling to the write-back delay of the multiplication. Unrolling a loop is something that takes effort to convince a compiler to do, since this often restricts the length of your data (is it a multiple of 4? for example). With load/stores there is less interesting optimizations to be done since there is no write-back delay from a math operation.

There is lots of stuff on pipelined processor architecture, but

https://en.wikipedia.org/wiki/Classic_RISC_pipeline#Writeback

ARM NEON Optimization no faster than C++ Pointer Implementation

Answers (1)

Related Questions