Power and timing reports of two different vhdl designs

Question

Let's suppose I have two circuits (described in vhdl), the first one performs the following algorithm in loop(pseudo code):

C<=A+B;
D<=C+F;
RES <= D;

I represent this algorithm applying the Finite State Machine(FSM) logic. Thus:

State1:C<=A+B;
       out_ready<='0';--the result is not ready yet
       nextstate<=State2;
State2:D<=C+F;
       nextstate<=S_out;
S_out: RES<=D;
       out_ready<='1';--the result is ready
       nextstate<=State1;

The second algorithm is also sequential and represented through FSM logic. Take in mind that CSA is a Carry Save Adder having 3 inputs(A, B, C) and generates instantly two results S and C:

S <= '0'&(A xor B xor C);
C <= (A and B) or (A and C) or (B and C) & '0';

If we sum S and C we obtain the sum result

RES <= S+C;

The advantage is that you can work, in some cases, with only S and C (generates in one clock cycle) vectors without the need of adding them. OK, come back to my second algorithm:

(S,C)=(A,B,Carry_in);
(S,C)=(S,C,F);
RES_S=S;
RES_C=C;

Also this represented applying FSM:

State_CSA_1:
   S <= '0'&(A xor B xor 0);
   C <= (A and B) or (A and 0) or (B and 0) & '0';
   out_ready<='0';--the result is not ready yet
   nextstate<=State_CSA_2;
State_CSA_2:
   S <= '0'&(S xor C xor F);
   C <= (S and C) or (S and F) or (C and F) & '0';
   nextstate<=S_out;
S_out:
   RES_S<=S;
   RES_C<=C;
   out_ready<='1';--the result is ready
   nextstate<=State_CSA_1;

So, if I do the simulation (I've used modelsim), with a testbench file that alternating the polarity of clock signal each 0,5 ns, I obtain that in both cases the result is generated after 3 clock cycle. But it is obvious that the second algorithm is much faster. Considering that I must write a report about the differences between the two circuits, I have the following questions:

1)I wanna know the time needed in order to perform the two algorithms. If I do the timing analysis, for example with Xilinx ISE, there will be differences between the performance of the two circuits? Or also in this case the time will be deduced by the 3 clock cycles?

2) I must report the time, the power consumption and the space occupied(area). Which software do you recommend? Since I do not have much time, something easy to use or well documented(tutorials and so on).

PS The two algorithms were invented as I was writing this post, I'm working on other boring things.

Jonathan Drolet · Accepted Answer

Digital circuit performances is measured in terms of: throughput, max operating frequency, latency, area and power.

Your 3 clock cycles is the latency, which is quite easy to deduce from your VHDL since your FSM is 3 cycles from input to output. Max operating frequency and area are given by the synthesis tool. Power can be estimated with all synthesis tool (ISE has XPower), it can be very precise if you input data correctly.

Finally, throughput is a measure of how much data you can process. In both architecture, your output is available 1/3 cycles, thus your throughput is 1/3. Compare to this description:

process(clk)
    if rising_edge(clk) then
        s_0 <= a + b;
        v_0 <= in_valid;
        f_0 <= f;

        s_1 <= s_0 + f_0;
        v_1 <= v_0;

        res <= s_1;
        out_ready <= v_1;
    end if;
end process;

In my description, latency is also 3, but I can get an output every cycle, thus my throughput is 1/cycle, 3 times as much as your descriptions. Area should theoretically be more than your circuits, but probably not since there is less overhead. A more complicated example would see more difference, addition is pretty simple, especially in FPGAs.

From my experience, latency is rarely an issue, except on interfaces and selected applications. We want maximum throughput and minimum area, sometimes power (though FPGA are not very good at that). Maximum operating frequency is related to throughput, a 1/clock cycle circuit that can run at 200MHz has twice the overall throughput of one that runs at 100MHz.

Finally, you seem confused on the use of CSA. It's advantage is it's constant delay. It's takes as long to do CSA on 1024-bits as it takes on 4-bits. By constraining your CSA to run once per clock cycles, you negate the advantage as the faster time is lost; the final maximum operating frequency will be dictated by the full adder anyway.

You should also be aware that Xilinx LUT6's technology (Spartan-6 and newer) can perform 3-input addition for the same cost (timing and area) as a 2-input addition. It uses the LUT6 to perform the CSA, and the fast carry-logic to do the final addition. The 2-input addition also use the LUT6 (as route-through) and the fast carry-logic. Thus, there is no disadvantage to use 3-input adders on Xilinx.

Power and timing reports of two different vhdl designs

Answers (1)

Related Questions