Rauli Kumpulainen
Rauli Kumpulainen

Reputation: 336

Integrate ARMv8 Crypto accelerated SHA256 into existing C implementation function

I am trying to employ an ARMv8/Aarch64 assembly implementation using hardware accelerated SHA256 transform/"data block ordering" through CPU Crypto extensions.

The problem is I do not understand the difference between the existing void sha256_transform(uint32_t *state, const uint32_t *block, int swap) (1) https://github.com/fireworm71/veriumMiner/blob/main/algo/sha2.c#LC81 and the new ARMv8 sha256 function void sha256_block_data_order (uint32_t *ctx, const void *in, size_t num) (2) https://github.com/glukolog/m-cpuminer-v2.1-armv8/blob/master/miner.h#L135 which I was trying to integrate with a preprocessor conditional to either leave the original code in if not aarch64 or redirect the work to the HW accelerated sha256_block_data_order and merely leaving the original sha256_transform as a wrapper function so the numerous calls to it from other functions do not need much, if any editing.

I would like to draw attention to the int swap parameter for sha256_transform which the original C implementation takes as a conditional doing something to the block data parameter where as the sha256_block_data_order parameter in the same position seems to have something to do with the "number of blocks" and not related. In my most recent attempt I left the "int swap" code in before calling sha256_block_data_order so the data would be manipulated as originally intended and also tried to do casting in its function call thinking that was the culprit behind it all...below is the code I wrote recently for use inside void sha256_transform(uint32_t *state, const uint32_t *block, int swap)

#if defined(__aarch64__)

    uint32_t W[64];
    uint32_t S[8];
    int i;

    // 1. Prepare message schedule W.
    if (swap) {
        for (i = 0; i < 16; i++)
            W[i] = swab32(block[i]);
        } else
        memcpy(W, block, 64);
        for (i = 16; i < 64; i += 2) {
            W[i]   = s1(W[i - 2]) + W[i - 7] + s0(W[i - 15]) + W[i - 16];
            W[i+1] = s1(W[i - 1]) + W[i - 6] + s0(W[i - 14]) + W[i - 15];
    }

sha256_block_data_order(state, (const unsigned char *) W, 1);

    for (i = 0; i < 8; i++)
        state[i] += S[i];
#else

For the last loop I have a guess its equivalent in another application where the ARMv8 sha256_block_data_order is being used already at least in its C implementation but do not understand assembly well enough to see if the same is being done there. uint32_t S is being used in the C implementation of sha256_transform but I cannot tell if its being altered in any way before the final loop in my code above.

In link (2) there is a declaration of sha256_transform however is not applicable though you can compare the inputs (the two separate applications share the same implementation of it) to sha256_block_data_order also declared there to see the difference.

In my various attempts to make this work, I am able to get it to compile without errors and run the cpuminer binary in benchmarking mode but when I have it do real work for a server, all results are rejected. Also, I only find single digit differences in performance (measured in hash rates). Would this suggest that even if I could get it to work correctly I am not going to see any worthwhile gains in performance?

I spent days on this and not being very experienced or skilled have resorted to asking here. Any feedback or advice is appreciated.

Upvotes: 2

Views: 1005

Answers (1)

Rauli Kumpulainen
Rauli Kumpulainen

Reputation: 336

I eventually managed to figure it out. See github fork repo if interested. Approximately 1% speed boost to hashing application on account of shrunken inline-able sha256 function. Along with better memcpy and dual issue optimization.

https://github.com/rollmeister/veriumMiner

Upvotes: 2

Related Questions