Reputation: 3
I'm using 256 bit variables (__m256i type) in new version of my program on AVX2 and I use Intel intrinsics. Before, 64 bit chunks are used for processing the data. So, _mm_crc32_u64 function is used for CRC calculation.
crc = _mm_crc32_u64(seed,*chunk_64bit);
But now, in order to improve performance I want to calculate CRC for each 256 bit chunks (at least 128 bit chunks) seperately. One way can be like that apply _mm_crc32_u64 in a loop with 64 bit values at each chunks. But I think it is not beneficial in terms of performance.
What is the best method for calculating CRC over 256 bit chunk (or 128 bit) which is faster than _mm_crc32_u64 operation in total ?
Upvotes: 0
Views: 677
Reputation: 112189
You can interleave three crc32
instructions for higher performance. See this answer for code that does that. You can take it a step further by running that code on multiple processors and combining the resulting CRCs.
Upvotes: 1