Reputation: 1195

Fast, small area and low latency partial sorting algorithm

I'm looking for a fast way to do a partial sort of 81 numbers - Ideally I'm looking to extract the lowest 16 values (its not necessary for the 16 to be in the absolutely correct order).

The target for this is dedicated hardware in an FPGA - so this slightly complicated matters as I want the area of the resultant implementation as small as possible. I looked at and implemented the odd-even merge sort algorithm, but I'm ideally looking for anything that might be more efficient for my needs (trade algorithm implementation size for a partial sort giving lowest 16, not necessarily in order as opposed to a full sort).

Upvotes: 8

Answers (6)

greybeard

Reputation: 2497

Musing selection of k values of n using combinatoric circuits recently, I'm thinking selection network (of CEs: compare & exchange elements).
As the size of practical such networks grows with nlog₂n, one should compare different partitionings:

sorters	sorter CEs	layers	merge CEs	layers	total CEs	layers	spanned
40+41	216+230	17	16	1	462	18	3557
3*27	414	14	48+16	5 + 1	478	20	2905
3*20+21	273+97	12	2*48+16	5 + 1	482	18
4*16+17	120+122+71	9	3*48+16	2*5 + 1	473	20

(Above "Knuth diagram" shows 81-16 = 65 element where the min/lower output is dispensable - every top 16-of-81 network does.)
The cost of those alternatives is closer than I expected.
The best network from above from a known good 40-sorter, a known fast 41-sorter and a merger not preserving order:

Restituting order requires 32 more CEs in 4 more layers.

Upvotes: 0

shuckc

Reputation: 2993

If you want to avoid state machines and loop-indexing (with associated large muxes or RAMs) then a hardware friendly approach might be to use a sorting network for 17 items (yuck!). Feed this with 16 items from registers (initially empty) and 1 item from the input bus. For each valid cycle on the input bus, the output of the sorting network is the smallest 16 seen so far with the new element either within the results or at the top (if it is larger). The smallest 16 then latch into the registers for the next cycle and the top output from the sorting network is ignored.

This will require 16*(n-bits per element) registers, and a sorting network that is 5 subtract/compare blocks deep and 8 wide, for 40 subtract/complete blocks total (each with a worst case propagation of n-bits). Can be pipelined between stages trivially if you can back pressure the input bus. The solution is almost constant in time and space: 1 cycle per input, and space is only a function of the number of outputs not inputs.

Upvotes: 0

user597225

Reputation:

This sounds like a signal processing kernel of some sort. It's hard to help answer this without knowing the exact data flow in your design. Any algorithm involving a sort has a address decoding cost since you'll need to be able to write and read a memory of 81-elements. If this data is in a memory this cost has already been paid but if it is in distinct registers then writing to them carries an area cost.

Assuming the data is in a memory, you could use bubble sort and take bottom 16 values. The below code assumes a two-port memory but it could work with a single port by alternating reads and writes on every clock cycle using a temporary register to store the write value and write index. This may not be more area efficient with only 81 elements in the memory. Alternatively the source memory could be implemented as two single-port memories with one having odd indices and the other even indices.

// Not working code 
reg [15:0] inData [80:0]; // Must be two-port
reg [3:0]  iterCount = 0;
reg [6:0]  index = 0;
reg sorting;

always @(posedge clk)
  begin
  // Some condition to control execution
  if(sorting)
    begin

    if(index == 80)
      begin 

      // Stop after 16 iterations
      if(iterCount == 15)
        begin
        sorting <= 0;
        iterCount <= 0;
        end
      else
        begin
        iterCount <= iterCount+1;
        index <= 0;
        end
      end 
    else
      begin
      // If this is smaller than next item
      if(inData[index] < inData[index+1])
        begin
        // Swap with next item
        inData[index+1] <= inData[index];
        inData[index]   <= inData[index+1];
        end
      index <= index + 1;
      end
    end
  end

EDIT: If you're latency constrained, allowed only one clock domain and must pipeline then the problem is limited to selecting a sorting network and mapping it to a pipeline. You can't use resource sharing so area is fixed given your sorting network.

Select a sorting network(Bitonic, Pairwise,etc) for N=81 (Not easy)
Build a directed data flow graph representing the sorting network
Remove all nodes except those required for outputs 66-81
Determine the latency L of one comparison node
Use ALAP scheduling to assign M nodes per clock where M*L < 1/f
If scheduling succeeds code it up in an HDL

Upvotes: 2

mishadoff

Reputation: 10789

This can be done in O(nlog_k) time, where in your case n = 81 and k = 16. This is much efficiently when you dealing with big number of elements than O(nlog_n).

Algorithm is following:

Init max heap data structure. This heap will contain your top 16. Size of this heap will not exceed 16 and it has log(k) for insertion and deletion
Iterate through list of n - numbers. For each n:
- if heap has less than 16 elements, just add it
- if heap has 16 elements, compare n with max from heap (if it is greater than max just skip, if less than max, remove max and add n.)

This way at every iteration you keep tracking smallest k numbers from currently processed part of list.

Upvotes: 8

Alberto Bonsanto

Reputation: 18042

First set an array with 16 values. And use something like selection sort:

You think the first value is the minimal in the array.
Compare the value of the first element, second, until you find a number lower than it.
Take that as your new minimal, and compare it with the next numbers until you find a number lower than it.
If you finished the array, the number you have is your minimal, save it in an array and exclude it from the source array, and start over again until you fill the 16 positions of the solution array.

Upvotes: 0

Origin

Reputation: 2017

If you're looking for general algorithms, then you could use a version of Quicksort. Quicksort sorts values around a pivot element. You choose a pivot and sort your array. You then get x values < pivot and (n-x-1) larger ones. Depending on x, you choose either one array to continue processing. If x>16, then you know that the numbers your looking for are all in the x-array and you can continue sorting that. If not, then you know x lowest and can now look for the 16-x others in the other array recursively.

The resulting arrays of quicksort are not fully sorted, you only know that they are smaller or larger than your pivot. Some info at wikipedia and a paper.

Upvotes: 1

Fast, small area and low latency partial sorting algorithm

Answers (6)

Related Questions