Reputation: 180
I am trying to get familiar with Neon instructions. Both assembly and intrinsics. I usee gcc V4.8.2 hardfp I would like to use the NEON memcpy with preload accordindg to :
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html
I have also found this topic : ARM memcpy and alignment but this is slightly different from the official ARM page implementation.
Unfortunately I have never used .s with .c files at the same time so I need some help. My .c file looks like this:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <time.h>
#include <stdint.h>
#include <arm_neon.h>
int main()
{
clock_t start, end; // timer variables
uint32_t i,X=100;
size_t size = 2048*32/* arbitrary */;
size_t offset = 1;
char* src = malloc(sizeof(char)*(size + offset));
char* dst = malloc(sizeof(char)*(size));
NEONCopyPLD( dst, src + offset, size );
memcpy( dst, src + offset, size );
return(0);
}
and the assembly.s file is the following:
.global NEONCopyPLD
NEONCopyPLD:
PLD [r1, #0xC0]
VLDM r1!,{d0-d7}
VSTM r0!,{d0-d7}
SUBS r2,r2,#0x40
BGE NEONCopyPLD
I compile the following program by using the instruction :
arm-linux-gnueabihf-gcc -mthumb -march=armv7-a -mtune=cortex-a9 -mcpu=cortex-a9 -mfloat-abi=hard -mfpu=neon -Ofast -fprefetch-loop-arrays assembly.s asm_pr.c -o output
and I get the following error:
potentially unexpected fatal signal 11.
CPU: 0 PID: 670 Comm: out_asm Not tainted 3.10.9-rt5+ #2
task: bf907c00 ti: bef4a000 task.ti: bef4a000
PC is at 0x4c90ce LR is at 0x852d
pc : [<004c90ce>] lr : [<0000852d>] psr: 40030030
sp : 7e958cb0 ip : 00000107 fp : 00000000
r10: 76f91000 r9 : 00000000 r8 : 00000000
r7 : 00001017 r6 : 00e85010 r5 : 00e75009 r4 : 00010001
r3 : 000f4240 r2 : 00010000 r1 : 00e75009 r0 : 00e85010
Flags: nZcv IRQs on FIQs on Mode USER_32 ISA Thumb Segment user
Control: 10c5387d Table: 4ef7404a DAC: 00000015
CPU: 0 PID: 670 Comm: out_asm Not tainted 3.10.9-rt5+ #2
Backtrace:
[<800120a4>] (dump_backtrace+0x0/0x118) from [<80012318>] (show_stack+0x20/0x24)
[<800122f8>] (show_stack+0x0/0x24) from [<804fab0c>] (dump_stack+0x24/0x28)
[<804faae8>] (dump_stack+0x0/0x28) from [<8000f560>] (show_regs+0x30/0x34)
[<8000f530>] (show_regs+0x0/0x34) from [<8003349c>](get_signal_to_deliver+0x318/0x668)
[<80033184>] (get_signal_to_deliver+0x0/0x668) from [<80011664>] (do_signal+0x11c/0x450)
[<80011548>] (do_signal+0x0/0x450) from [<80011b20>] (do_work_pending+0x74/0xac)
[<80011aac>] (do_work_pending+0x0/0xac) from [<8000e500>] (work_pending+0xc/0x20)
Segmentation fault
Another question I have is if we can we use SIMD instructions (intrinsics or autovectorization) to speed up the initialization of an array with 0? I have noticed that the following code cannot be autovectorized :
for (i=0;i<N;i++)
*(a++)=0;
however this block of code can be autovectorized:
for (i=0;i<N;i++)
a[i]=i;
My ultimate goal is to investigate if I can have a NEON function that runs faster than memset()
.
Finally i would like to ask something on unvectorizable loops. According to : http://gcc.gnu.org/projects/tree-ssa/vectorization.html#unvectoriz the following code cannot be autovectorized:
while (*p != NULL) {
*q++ = *p++;
}
However is it possible to use intrinsics or assembly to develop a faster version of this loop? If you have done something similar could you please post it here?
Upvotes: 0
Views: 4300
Reputation: 18217
Unrelated to your question as such, but your code sample as shown cannot work correctly. That's because you seem to have alignment traps active, and are hitting one:
[ ... ]
size_t offset = 1;
char* src = malloc(sizeof(char)*(size + offset));
[ ... ]
NEONCopyPLD( dst, src + offset, size );
r7 : 00001017 r6 : 00e85010 r5 : 00e75009 r4 : 00010001 r3 : 000f4240 r2 : 00010000 r1 : 00e75009 r0 : 00e85010 ^^^^^^^^
You're using a misaligned pointer with VLDM
(src
is never aligned due to offset == 1
).
From the reg dump, since your Neon asm function doesn't use R5
on its own, the fact you're seeing R1 == R5
makes me conclude you're running with alignment traps enabled, and get the SIGSEGV
the very first time you hit that VLDM
.
That's because you're not using R5
in your assembly, so the value being there has been used before by the C function; therefore R1
and R5
not being different means R1
hasn't changed before the trap was taken, and that means the VLDM R1!,...
cannot have executed even once.
Upvotes: 2
Reputation: 6354
You may google for "aosp bionic memcpy".
It's not a perfect, but quite a decent implementation.
I suggest you starting with memset instead though since memcpy is much more complicated than you might think.
Analyze bionic memset, try to understand the flow, and ask if you don't understand why the author did something in that particular way.
And I also don't understand why you are talking about auto-vectorization which is utterly useless IMO.
Please, do some study on your own first, and ask if you get stuck.
To answer this particular question, it would take a whole tutorial consisting of multiple chapters, starting with basic ARM instructions.
Upvotes: 1
Reputation: 86313
You never return from your assembler functions. Therefore whatever code is stored below the assembler function will get executed. This will lead to a crash sooner or later.
Exit your functions with:
mov pc, lr
This will very likely fix your problems. You should also check which registers (neon and general purpose registers) you have to preserve during assembler function calls.
This page here is a useful resource that shows examples how to do this: http://omappedia.org/wiki/Writing_ARM_Assembly
Upvotes: 1