Reputation: 13
I have a problem about inline-assembly in AArch64, Linux, gcc version is 7.3.0
uint8x16_t vcopyq_laneq_u8_inner(uint8x16_t a, const int b, uint8x16_t c, const int d)
{
uint8x16_t res;
__asm__ __volatile__(
:"ins %[dst].B[%[dlane]], %[src].B[%[sland]] \n\t"
:[dst] "=w"(res)
:"0"(a), [dlane]"i"(b), [src]"w"(c), [slane]"i"(d)
:);
return res;
}
This function used to be a inline function that can be compiled and link to a executable programs. But now we want to compile this function into a dynamic library, so we removed its inline keyword. But it cannot compile successfully, and error info is:
warning: asm operand 2 probably doesn't match constraints
warning: asm operand 4 probably doesn't match constraints
error: impossible constraint in 'asm'
I guess this error happend because of the inline-assembly code "i" need a "immediate integer operand", but the var 'b' and 'd' is constant-var, isn't it?
And now i have an idea to make this function compile successfully, thats use if-else to judge the value of 'b' and 'd', and replace dlane/sland with "immediate integer operand". But in our code, uint8x16_t means a structrue of 16 uint8_t var, so i need coding 16x16==256 if-else statement, thats inefficient.
So my question is following:
Upvotes: 0
Views: 1581
Reputation: 4003
But now we want to compile this function into a dynamic library, so we removed its inline keyword. But it cannot compile successfully, and error info is:
warning: asm operand 2 probably doesn't match constraints
warning: asm operand 4 probably doesn't match constraints
error: impossible constraint in 'asm'
I guess this error happend because of the inline-assembly code "i" need a "immediate integer operand"
In GCC, constraint "i"
means "immediate operand", which is a value that is known at link-time or earlier, and that is an integer or an address. For example, the address of a variable in static storage is known at link time, and you can juse it just like a known value (provided the assembler supports a RELOC for it, which is beyond GCC).
but the var 'b' and 'd' is constant-var, isn't it?
const
in C basically means read-only, which does not imply the value is know at link-time or earlier.
If that function was inline, and the context (hosting function and compiler optimization) is such that the values turn out to be known, then the constraints can be satisfied.
If the context is such that "i"
cannot be satisfied — which is the case for a library function where you don't know the context at compile-time — then gcc will throw an error.
One way is to supply the function as static inline
in the header that accompanies the library (*.so, *.a, etc.) and describes the library interfaces and public functions. In that case the user is responsible to only use the function in appropriate contexts (or get that error message thrown at them).
Second way is to re-write the inline assembly to use instructions which can handle operands that are only known at run-time, e.g. register operands. This is usually less efficient and generates higher register pressure. In the case of a library function, you will add call-overhead just to issue one instruction.
Third way is o combine both approaches and supply the function as static inline
in the library header, but write it like
static inline __attribute__((__always_inline__))
uint8x16_t vcopyq_laneq_u8_inner (uint8x16_t a, int b, uint8x16_t c, int d)
{
uint8x16_t res;
if (__builtin_constant_p (b) && __builtin_conpstant_p (d))
{
__asm__ __volatile__(
: "ins %[dst].B[%[dlane]], %[src].B[%[sland]]"
: [dst] "=w" (res)
: "0" (a), [dlane] "i" (b), [src] "w" (c), [slane] "i" (d));
}
else
{
__asm__ __volatile__(
// Use code and constraints that can handle non-"i" b and d.
}
return res;
}
This allows the compiler to use the optimal code when b
and d
are in "i"
, but it makes the function so generic that it will also work in a broader context.
Apart from that, nothing about that instructions seems to warrant volatile
. If, for example, the return value is unused, the instruction is not needed, right? In that case, remove the volatile
, which adds more freedom to schedule the inline asm.
Upvotes: 1
Reputation: 58929
Constraint "i" means a number. A specific number. It means you want the compiler to emit an instruction like this:
ins v0.B[2], v1.B[3]
(pardon me if me AArch64 assembly syntax isn't quite right) where v0 is the register containing res
, v1 is the register containing c
, 2 is the value of b
(not the number of the register containing b
) and 3 is the value of d
(not the number of the register which containing d
).
That is, if you call
vcopyq_laneq_u8_inner(something, 2, something, 3)
the instruction in the function is
ins v0.B[2], v1.B[3]
but if you call
vcopyq_laneq_u8_inner(something, 1, something, 2)
the instruction in the function is
ins v0.B[1], v1.B[2]
The compiler has to know which numbers b
and d
are, so it knows which instruction you want. If the function is inlined, and the parameters b
and d
are constant numbers, it's smart enough to do that. However, if you write this function in a way where it's not inlined, the compiler has to make an actual function that works no matter what number the b
and d
parameters are, and how can it possibly do that if you want it to use a different instruction depending on what they are?
The only way it could do that is to write all 256 possible instructions and switch between them depending on the parameters. However, the compiler won't do that automatically - you'd need to do it yourself. For one thing, the compiler doesn't know that b
and d
can only go from 0 up to 15.
You should consider either not making this a library function (it's one instruction - doesn't doing a call into a library add overhead?) or else using different instructions where the lane number can be from a register. The instruction ins
copies one vector element to another. I'm not familiar with ARM vector instructions, but there should be some instructions to rearrange or select items in a vector according to a number stored in a register.
Upvotes: 1
Reputation: 365577
const
means you can't modify the variable, not that it's a compile-time constant. That's only the case if the caller passes a constant, and you compile with optimization enabled so constant-propagation can get that value to the asm statement. Even C++ constexpr
doesn't require a constant expression in most contexts, it only allows it, and guarantees that compile-time constant-propagation is possible.
A stand-alone version of this function can't exist, but you didn't make it static
so the compiler has to create a non-inline definition that can get called from other compilation units, even if it inlines into every call-site in this file. But this is impossible, because const int b
doesn't have a known value.
For example,
int foo(const int x){
return x*37;
}
int bar(){
return foo(2);
}
On Godbolt compiled for AArch64: notice that foo
can't just return a constant, it needs to work with a run-time variable argument, whatever value it happens to be. Only in bar
with optimization enabled can it inline and not need the value of x
in a register, just return a constant. (Which it used as an immediate for mov
).
foo(int):
mov w1, 37
mul w0, w0, w1
ret
bar():
mov w0, 74
ret
In a shared library, your function also has to be __attribute__((visibility("hidden")))
so it can actually inline, otherwise the possibility of symbol interposition means that the compiler can't assume that foo(123)
is actually going to call int foo(int)
defined in the same .c
(Or static inline
.)
Is there have an efficient way to avoid using 256 if-else statement?
Not sure what you're doing with your vector exactly, but if you don't have a shuffle that can work with runtime-variable counts, store to a 16-byte array can be the least bad option. But storing one byte and then reloading the whole vector will cause a store-forwarding stall, probably similar to the cost on x86 if not worse.
Doing your algorithm efficiently with AArch64 SIMD instructions is a separate question, and you haven't given enough info to figure out anything about that. Ask a different question if you want help implementing some algorithm to avoid this in the first place, or an efficient runtime-variable byte insert using other shuffles.
Upvotes: 2