Reputation: 1195
I'm looking for a fast way to do a partial sort of 81 numbers - Ideally I'm looking to extract the lowest 16 values (its not necessary for the 16 to be in the absolutely correct order).
The target for this is dedicated hardware in an FPGA - so this slightly complicated matters as I want the area of the resultant implementation as small as possible. I looked at and implemented the odd-even merge sort algorithm, but I'm ideally looking for anything that might be more efficient for my needs (trade algorithm implementation size for a partial sort giving lowest 16, not necessarily in order as opposed to a full sort).
Upvotes: 8
Views: 4798
Reputation: 2497
Musing selection of k values of n using combinatoric circuits recently, I'm thinking selection network (of CEs: compare & exchange elements).
As the size of practical such networks grows with nlog2n,
one should compare different partitionings:
sorters | sorter CEs | layers | merge CEs | layers | total CEs | layers | spanned |
---|---|---|---|---|---|---|---|
40+41 | 216+230 | 17 | 16 | 1 | 462 | 18 | 3557 |
3*27 | 414 | 14 | 48+16 | 5 + 1 | 478 | 20 | 2905 |
3*20+21 | 273+97 | 12 | 2*48+16 | 5 + 1 | 482 | 18 | |
4*16+17 | 120+122+71 | 9 | 3*48+16 | 2*5 + 1 | 473 | 20 |
(Above "Knuth diagram" shows 81-16 = 65 element where the min/lower output is dispensable - every top 16-of-81 network does.)
The cost of those alternatives is closer than I expected.
The best network from above from a known good 40-sorter, a known fast 41-sorter and a merger not preserving order:
Restituting order requires 32 more CEs in 4 more layers.
Upvotes: 0
Reputation: 2993
If you want to avoid state machines and loop-indexing (with associated large muxes or RAMs) then a hardware friendly approach might be to use a sorting network for 17 items (yuck!). Feed this with 16 items from registers (initially empty) and 1 item from the input bus. For each valid cycle on the input bus, the output of the sorting network is the smallest 16 seen so far with the new element either within the results or at the top (if it is larger). The smallest 16 then latch into the registers for the next cycle and the top output from the sorting network is ignored.
This will require 16*(n-bits per element) registers, and a sorting network that is 5 subtract/compare blocks deep and 8 wide, for 40 subtract/complete blocks total (each with a worst case propagation of n-bits). Can be pipelined between stages trivially if you can back pressure the input bus. The solution is almost constant in time and space: 1 cycle per input, and space is only a function of the number of outputs not inputs.
Upvotes: 0
Reputation:
This sounds like a signal processing kernel of some sort. It's hard to help answer this without knowing the exact data flow in your design. Any algorithm involving a sort has a address decoding cost since you'll need to be able to write and read a memory of 81-elements. If this data is in a memory this cost has already been paid but if it is in distinct registers then writing to them carries an area cost.
Assuming the data is in a memory, you could use bubble sort and take bottom 16 values. The below code assumes a two-port memory but it could work with a single port by alternating reads and writes on every clock cycle using a temporary register to store the write value and write index. This may not be more area efficient with only 81 elements in the memory. Alternatively the source memory could be implemented as two single-port memories with one having odd indices and the other even indices.
// Not working code
reg [15:0] inData [80:0]; // Must be two-port
reg [3:0] iterCount = 0;
reg [6:0] index = 0;
reg sorting;
always @(posedge clk)
begin
// Some condition to control execution
if(sorting)
begin
if(index == 80)
begin
// Stop after 16 iterations
if(iterCount == 15)
begin
sorting <= 0;
iterCount <= 0;
end
else
begin
iterCount <= iterCount+1;
index <= 0;
end
end
else
begin
// If this is smaller than next item
if(inData[index] < inData[index+1])
begin
// Swap with next item
inData[index+1] <= inData[index];
inData[index] <= inData[index+1];
end
index <= index + 1;
end
end
end
EDIT: If you're latency constrained, allowed only one clock domain and must pipeline then the problem is limited to selecting a sorting network and mapping it to a pipeline. You can't use resource sharing so area is fixed given your sorting network.
Upvotes: 2
Reputation: 10789
This can be done in O(nlog_k)
time, where in your case n = 81 and k = 16. This is much efficiently when you dealing with big number of elements than O(nlog_n)
.
Algorithm is following:
n
with max from heap (if it is greater than max just skip, if less than max, remove max
and add n
.)This way at every iteration you keep tracking smallest k numbers from currently processed part of list.
Upvotes: 8
Reputation: 18042
First set an array with 16 values. And use something like selection sort:
Upvotes: 0
Reputation: 2017
If you're looking for general algorithms, then you could use a version of Quicksort. Quicksort sorts values around a pivot element. You choose a pivot and sort your array. You then get x values < pivot and (n-x-1) larger ones. Depending on x, you choose either one array to continue processing. If x>16, then you know that the numbers your looking for are all in the x-array and you can continue sorting that. If not, then you know x lowest and can now look for the 16-x others in the other array recursively.
The resulting arrays of quicksort are not fully sorted, you only know that they are smaller or larger than your pivot. Some info at wikipedia and a paper.
Upvotes: 1