Reputation: 1307

Fastest way to add a mixture of doublewords and quadwords using AVX2?

I'm writing code that generates highly optimized machine code targeting Haswell (so it has AVX2 instructions) and I'm trying to figure out the most efficient way to add together a pre-determined number of quadwords and doublewords. So, for example, I might have a structure like this:

  0-8: QWORD a
 8-16: QWORD b
16-20: DWORD c
20-28: QWORD d
28-36: QWORD e
36-40: DWORD f
40-48: QWORD g
48-56: QWORD h
56-64: QWORD i

I'd like to add this to another structure with the same layout, such that a(final) = a(first) + a(second), b(final) = b(first) + b(second), etc. I've been looking at the VPADDUSD and VPADDUSQ instructions, but obviously neither will work in all cases. VPADDUSD fails for adding QWORDs that exceed (2^32)-1. VPADDUSQ fails if a QWORD is not 8-byte aligned. I'm alright with overflows resulting in bad data being generated. I would consider a mispredicted branch to cost a solid 15 cycles. It is acceptable to optimize this for numbers that are typically not larger than 2^31. Ideas?

Upvotes: 3

Answers (3)

fuz

Reputation: 93107

Load the structure into an ymm register. Permute dwords such that each dword is zero-extended into a qword and each qword is at a qword boundary. Then do a qword add. Finally, undo the permutation to get the structure padding back. Discard the high 32 bit for dword fields.

For example, for your structure, you could do the following series of operations:

Load one 256 bit value from offset 0 of the structure into ymm0. The register should now contain the following dwords:

  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
 al ah bl bh cx dl dh el eh fx gl gh hl hh il ih

Now permute the register using vpermilps such that it contains the following values:

  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
 al ah bl bh cx xx dl dh fx xx gl gh hl hh il ih

After that, you can apply a mask such that the xx entries are zeroes. Or you can ignore them as their values don't really matter.

Note that el and eh have vanished from the structure, we need to manually add these in a separate step. We eliminate el and eh instead of, say, il and ih because we cannot permute across the two 128 bit lanes. Note how the two dwords (c and f) have been zero-extended to 64 bit. You can now add two registers with this permutation and apply an appropriate permutation to pack them back as they were before.

If you can change the order of fields, this is much easier: Just rearrange them such that all the qwords are first and then all dwords. Now you can simply add all qwords in one step and then all dwords without any shuffling.

Upvotes: 4

user555045

Reputation: 64903

For the case where some rearranging to ensure that qwords are completely contained in the 32-aligned block they start in (so they end up in a register instead of spanning across two registers), you could add dwords and then resolve carries only where you want them. Roughly (not tested)

vmovdqa ymm0, [x]
vpaddd ymm1, ymm0, [y]
vpmaxud ymm2, ymm1, ymm0
vpcmpeqd ymm2, ymm2, ymm1
vpandn ymm2, ymm2, [carrymask]
vpermd ymm2, ymm2, [lshift32]
vpaddd ymm1, ymm1, ymm2

The carry calculation is based on carry = (x + y) < x, but because there is no unsigned comparison, it's rewritten to carry = max(x + y, x) != x + y.

The slow vpermd makes me sad, slices try to ruin everything as usual so the old vpslldq doesn't really work here - unless you can rearrange the qwords so they also don't cross 16-byte boundaries.

You can save the carrymask and lshift32 in registers of course, so those loads don't really count.

Upvotes: 1

huseyin tugrul buyukisik

Reputation: 11916

The algorithm that I'm writing is often going to be limited by caches other than the L1

fuz's answer must be fastest but, limited by L2 cache means some latency could be hidden behind it, such as adding DWORD fields after the QWORDS because while SIMD ALUs busy computing QWORDs, you could load those 2 DWORDs from L2(to be added in scalar unit).

Maybe you should benchmark, calculate QWORDs while instruction level parallelism loads DWORDs from L2. Loaded DWORDS could be added in scalar ALU to gain even more parallelism maybe?

But again, fuz's answer must be the ultimate solution since everything is done in CPU-registers once and for all.

Upvotes: 2

Fastest way to add a mixture of doublewords and quadwords using AVX2?

Answers (3)

Related Questions