Reputation: 8202
I'm working with some piece of hardware (the hardware itself is not important) and I need to split some block data intro separate pieces in order to make the thing run faster.
So I have, for example a contiguous block of memory X words long. For visualization, I'm arranging it into 50 word lines below:
001 002 003 004 005 006 007 ...
051 052 053 054 055 056 057 ...
101 102 103 104 105 106 107 ...
151 152 153 154 155 156 157 ...
I need a fast way of splitting these into four separate blocks:
Block1
001 003 005 007 ...
101 103 105 107 ...
Block2
002 004 006 ...
102 104 106 ...
Block3
051 053 055 057 ...
151 153 155 157 ...
Block4
052 054 056 ...
152 154 156 ...
Or, basically:
Block1 Block2 Block1 Block2 ...
Block3 Block4 Block3 Block4 ...
Block1 Block2 Block1 Block2 ...
Block3 Block4 Block3 Block4 ...
Now doing this is as simple as using for-loops. But what is a more optimized/parallel way of doing this? (No MPI stuff, this happens on an app running on the desktop).
So summing it up, just to be clear:
I have data as shown above.
I'm sending this data several devices (outside the PC). This data needs to be sent down the wire as 4 separate blocks (to the separate devices).
Upvotes: 0
Views: 279
Reputation: 179819
This is a prime example where SSE can help you. It's very good at data shuffling as well as streaming data from memory and back. On some non-x86 architectures, there are similar ISA extensions available (e.g. AltiVec)
Upvotes: 1
Reputation: 156158
EDIT: It sounds like you're passing the data to an external interface. If this is anything as slow as a gigabit ethernet interface, then the bottleneck will be at the wire and not how quickly you can compose data. Just iterate over the data to build your blocks in any manner that seems convenient for your code.
Perhaps what you want to do is pass the blocks around using an offset/stride notation. Essentially, each block is described from its starting address, the into the block where the first element appears, the number of bytes between elements, and the number of bytes between rows. So, something like:
Block
1 2 3 4
base 0 0 50 50
first 0 1 0 1
offset 2 2 2 2
stride 100 100 100 100
So you could work on the data in parallel (assuming you don't have to worry about writes) something like this
struct Block {
int base;
int first;
int offset;
int stride;
int cols; rows;
};
/* given some reasonable block[n] and buffer */
for ( int row = 0; col < block[n].rows; ++row)
for (int col = 0; row < block[n].cols; ++col)
{
int cell = buffer[
block[n].base +
block[n].first +
row*block[n].stride +
col*block[n].offset]
doSomething(cell);
}
Upvotes: 0