Reputation: 172
I've been trying to create a counting sort algorithm using Verilog HDL, but when I tried to compile this iteration of it, Quartus started to compile it for a really long time. I can't figure out what is the issue.
module sort(reset, clk, data_in0,data_in1,data_in2,data_in3,data_in4,data_in5,data_in6,data_in7,data_in8,data_in9, data_out0, data_out1, data_out2, data_out3, data_out4, data_out5, data_out6, data_out7, data_out8, data_out9);
input wire reset, clk;
input wire [1:0] data_in0;
input wire [1:0] data_in1;
input wire [1:0] data_in2;
input wire [1:0] data_in3;
input wire [1:0] data_in4;
input wire [1:0] data_in5;
input wire [1:0] data_in6;
input wire [1:0] data_in7;
input wire [1:0] data_in8;
input wire [1:0] data_in9;
output reg [1:0] data_out0;
output reg [1:0] data_out1;
output reg [1:0] data_out2;
output reg [1:0] data_out3;
output reg [1:0] data_out4;
output reg [1:0] data_out5;
output reg [1:0] data_out6;
output reg [1:0] data_out7;
output reg [1:0] data_out8;
output reg [1:0] data_out9;
reg [1:0] mem [9:0];
reg[9:0] buff [3:0];
integer i,k,j,f,s;
always@ (posedge clk)
begin
for(i=0; i<4; i=i+1)
buff[i]<=0;
if (reset == 1) begin
for (i = 0; i < 10; i = i + 1) mem[i]<=0;
s=0;
f=0;
end
else begin
if (f==0)begin
mem [0] <= data_in0;
mem[1]<=data_in1;
mem[2]<=data_in2;
mem[3]<=data_in3;
mem[4]<=data_in4;
mem[5]<=data_in5;
mem[6]<=data_in6;
mem[7]<=data_in7;
mem[8]<=data_in8;
mem[9]<=data_in9;
f=1;
end
for( i = 0; i <10 ; i=i+1)
begin
buff[mem[i]]<=buff[mem[i]]+1;
end
if(s==0) begin
k<=0;
for( i = 0; i <4 ; i=i+1)
begin
for( j = 0; j < 10 ; j = j +1)
begin
if(j<buff[i])
begin
mem[k]<=i;
k<=k+1;
end
end
end
end s=1;
data_out0 = mem[0];
data_out1 = mem[1];
data_out2 = mem[2];
data_out3 = mem[3];
data_out4 = mem[4];
data_out5 = mem[5];
data_out6 = mem[6];
data_out7 = mem[7];
data_out8 = mem[8];
data_out9 = mem[9];
end
end
endmodule
It takes a really long time to pass the Analysis and Synthesis section. I assume it is due to mistakes in this code or the wrong use of operators, but I can't understand where it is exactly.
Upvotes: 0
Views: 641
Reputation: 11418
This is an example of how a FSM-based solution for your problem would be. It can be vastly improved, though, but it's just a starting (and hopefully working) point.
For start, I've changed your module interface. Discrete inputs can be used, but as the algorithm uses indexes to run over the entire input domain, I've assume two external memories: one with the input data and another one which will hold the output data. The module drives the corresponding address bus for both memories, as well as the write enable signal, and data buses. There is also a busy
signal so the rest of the system knows that the module has not yet finished sorting data. Finally, I've sorted 16 numbers instead of 10.
Internally, I've used a memory element, count
, as the vector that holds the histogram of the input data. As this memory is tiny, I've used as four independent registers. This allows me to use more than one element of "count" in the same clock cycle, as in count[3] <= count[3] + count[2] + count[1] + count[0];
The version of the algorithm I've used is from Wikipedia: https://en.wikipedia.org/wiki/Counting_sort
function countingSort(array, k) is
count ← new array of k zeros
for i = 1 to length(array) do
count[array[i]] ← count[array[i]] + 1
for i = 2 to k do
count[i] ← count[i] + count[i - 1]
for i = length(array) downto 1 do
output[count[array[i]]] ← array[i]
count[array[i]] ← count[array[i]] - 1
return output
And this is the Verilog module:
module sort (
input wire clk,
input wire reset,
output reg [3:0] addr_data_in,
input wire [1:0] data_in,
output reg [3:0] addr_data_out,
output reg [1:0] data_out,
output reg write_data_out_strobe,
output reg busy
);
/*
function countingSort(array, k) is
count ← new array of k zeros
for i = 1 to length(array) do
count[array[i]] ← count[array[i]] + 1
for i = 2 to k do
count[i] ← count[i] + count[i - 1]
for i = length(array) downto 1 do
output[count[array[i]]] ← array[i]
count[array[i]] ← count[array[i]] - 1
return output
*/
localparam
ZERO = 3'd0,
MAKEHIST1 = 3'd1,
MAKEHIST2 = 3'd2,
PREFIXSUM = 3'd3,
PLACEOUTPUT1 = 3'd4,
PLACEOUTPUT2 = 3'd5,
IDLE = 3'd7
;
reg [4:0] count[0:3];
reg [2:0] state = IDLE;
reg [1:0] data;
always @(posedge clk) begin
if (reset == 1'b1) begin
state <= ZERO;
write_data_out_strobe <= 1'b0;
busy <= 1'b1;
end
else begin
case (state)
ZERO:
//count ← new array of k zeros
begin
count[0] <= 4'd0;
count[1] <= 4'd0;
count[2] <= 4'd0;
count[3] <= 4'd0;
addr_data_in <= 4'd0;
state <= MAKEHIST1;
end
MAKEHIST1:
//for i = 1 to length(array) do
// count[array[i]] ← count[array[i]] + 1
begin
data <= data_in;
addr_data_in <= addr_data_in + 4'd1;
state <= MAKEHIST2;
end
MAKEHIST2:
begin
count[data] <= count[data] + 4'd1;
if (addr_data_in == 4'd0)
state <= PREFIXSUM;
else
state <= MAKEHIST1;
end
PREFIXSUM:
//for i = 2 to k do
// count[i] ← count[i] + count[i - 1]
begin
count[1] <= count[1] + count[0];
count[2] <= count[2] + count[1] + count[0];
count[3] <= count[3] + count[2] + count[1] + count[0];
addr_data_in <= 4'd15;
state <= PLACEOUTPUT1;
end
PLACEOUTPUT1:
//for i = length(array) downto 1 do
// output[count[array[i]]] ← array[i]
// count[array[i]] ← count[array[i]] - 1
begin
data <= data_in;
addr_data_in <= addr_data_in - 4'd1;
write_data_out_strobe <= 1'b0;
state <= PLACEOUTPUT2;
end
PLACEOUTPUT2:
begin
addr_data_out <= count[data] - 5'd1;
data_out <= data;
count[data] <= count[data] - 4'd1;
write_data_out_strobe <= 1'b1;
if (addr_data_in == 4'd15)
state <= IDLE;
else
state <= PLACEOUTPUT1;
end
IDLE:
begin
write_data_out_strobe <= 1'b0;
busy <= 1'b0;
end
endcase
end // of else
end // of always
endmodule
You can see that because of the way I'm using count, this will surely generate lots of muxes and decoders, just because I'm using a register value as the address for count[] in some places. However, I think this will synthesize much faster. Yosis can make it in a couple of seconds, FYI.
Besides, here you have a test bench for the above module:
module tb_counting_sort;
reg clk, reset;
wire [3:0] addr_data_in, addr_data_out;
wire [1:0] data_in,data_out;
wire write_data_out_strobe, busy;
sort uut (
.clk(clk),
.reset(reset),
.addr_data_in(addr_data_in),
.data_in(data_in),
.addr_data_out(addr_data_out),
.data_out(data_out),
.write_data_out_strobe(write_data_out_strobe),
.busy(busy)
);
reg [1:0] vector_in[0:15];
reg [1:0] vector_out[0:15];
assign data_in = vector_in[addr_data_in];
always @(posedge clk)
if (write_data_out_strobe == 1'b1)
vector_out[addr_data_out] <= data_out;
integer i;
initial begin
vector_in[0] = 2'd2;
vector_in[1] = 2'd1;
vector_in[2] = 2'd0;
vector_in[3] = 2'd0;
vector_in[4] = 2'd3;
vector_in[5] = 2'd1;
vector_in[6] = 2'd0;
vector_in[7] = 2'd2;
vector_in[8] = 2'd1;
vector_in[9] = 2'd1;
vector_in[10] = 2'd3;
vector_in[11] = 2'd3;
vector_in[12] = 2'd3;
vector_in[13] = 2'd2;
vector_in[14] = 2'd1;
vector_in[15] = 2'd0;
reset = 1'b1;
clk = 1'b0;
repeat (2)
@(posedge clk);
reset = 1'b0;
@(negedge busy);
for (i=0;i<16;i=i+1)
$write ("%1d ", vector_out[i]);
$display("");
$finish;
end
always begin
clk = #5 ~clk;
end
endmodule
Both modules can be viewed, simulated or synthesized at EDAPlayground, here: https://www.edaplayground.com/x/6GLj
Upvotes: 1
Reputation: 11418
for loops in Verilog don't work the way you seem to expect. This is not going to execute step by step, but the synthesis tool will try to unroll the loops, and as everything is contained within an always @(posedge clk)
, it will do all unrolled statements in a single clock cycle. Rethink your module using state machines to achieve sequentiality.
Upvotes: 2