user1766888
user1766888

Reputation: 409

VHDL matrix multiplication

Background: I'm trying to create a behavioral file for multiplying three matrices. I'm trying to debug it by first seeing if I can read the input matrix and then output the intermediate matrix.

Behavior File:

LIBRARY ieee;
USE ieee.std_logic_1164.ALL;

entity DCT_beh is
    port (
            Clk :           in std_logic;
            Start :         in std_logic;
            Din :           in INTEGER;
            Done :          out std_logic;
            Dout :          out INTEGER
          );
 end DCT_beh;

architecture behavioral of DCT_beh is 
begin
    process
            type RF is array ( 0 to 7, 0 to 7 ) of INTEGER;

            variable i, j, k        : INTEGER;
            variable InBlock        : RF;
            variable COSBlock       : RF;
            variable TempBlock      : RF;
            variable OutBlock       : RF;
            variable A, B, P, Sum   : INTEGER; 

    begin

            COSBlock := ( 
    ( 125,  122,    115,    103,    88,     69,     47,     24  ),
    ( 125,  103,    47,     -24,    -88,    -122,   -115,   -69  ),
    ( 125,  69,     -47,    -122,   -88,    24,     115,    103  ),
    ( 125,  24,     -115,   -69,    88,     103,    -47,    -122  ),
    ( 125,  -24,    -115,   69,     88,     -103,   -47,    122  ),
    ( 125,  -69,    -47,    122,    -88,    -24,    115,    -103  ),
    ( 125,  -103,   47,     24,     -88,    122,    -115,   69  ),
    ( 125,  -122,   115,    -103,   88,     -69,    47,     -24  )
                    );

--Starting
    wait until Start = '1';
        Done <= '0';

--Read Input Data
    for i in 0 to 7 loop
        for j in 0 to 7 loop    
            wait until Clk = '1' and clk'event;
            InBlock(i,j) := Din;
        end loop;
    end loop;

--TempBlock = COSBLOCK * InBlock 

    for i in 0 to 7 loop
        for j in 0 to 7 loop
            Sum := 0;
            for k in 0 to 7 loop
                A := COSBlock( i, k ); 
                B := InBlock( k, j ); 
                P := A * B; 
                Sum := Sum + P; 
                if( k = 7 ) then 
                TempBlock( i, j ) := Sum;
                end if;
            end loop;
        end loop;
    end loop;


--Finishing 

    wait until Clk = '1' and Clk'event;
    Done <= '1';

--Output Data

    for i in 0 to 7 loop
        for j in 0 to 7 loop
            wait until Clk = '1' and Clk'event;
            Done <= '0';
            Dout <=  tempblock(i,j);
        end loop;
    end loop;
end process;      
 end behavioral;

Testbench File:

LIBRARY ieee;
USE ieee.std_logic_1164.ALL;

-- Uncomment the following library declaration if using
-- arithmetic functions with Signed or Unsigned values
--USE ieee.numeric_std.ALL;

 ENTITY lab4b_tb IS
 END lab4b_tb;

ARCHITECTURE behavior OF lab4b_tb IS 

-- Component Declaration for the Unit Under Test (UUT)

COMPONENT DCT_beh
PORT(
     Clk : IN  std_logic;
     Start : IN  std_logic;
     Din : IN  INTEGER;
     Done : OUT  std_logic;
     Dout : OUT  INTEGER
    );
END COMPONENT;


   --Inputs
   signal Clk : std_logic := '0';
   signal Start : std_logic := '0';
   signal Din : INTEGER;

--Outputs
   signal Done : std_logic;
   signal Dout : INTEGER;

   -- Clock period definitions
   constant Clk_period : time := 10 ns;

 BEGIN

-- Instantiate the Unit Under Test (UUT)
   uut: DCT_beh PORT MAP (
      Clk => Clk,
      Start => Start,
      Din => Din,
      Done => Done,
      Dout => Dout
    );

   -- Clock process definitions
   Clk_process :process
   begin
    Clk <= '0';
    wait for Clk_period/2;
    Clk <= '1';
    wait for Clk_period/2;
  end process;


  -- Stimulus process
  stim_proc: process

variable i, j : INTEGER;
variable cnt : INTEGER;

  begin     
     -- hold reset state for 100 ns.

     wait for 100 ns;   

        start <= '1'; 
        wait for clk_period; 
        start <= '0';

    for cnt in 0 to 63 loop
        wait until clk = '1' and clk'event;
            din <= cnt;
        end loop;

        --wait for 100 ns;

        --start <= '1';
        --wait for clk_period;
        --start <= '0';

        --for i in 0 to 63 loop
          -- wait for clk_period;
            --if (i < 24) then
                --din <= 255;
            --elsif (i > 40) then
                --din <= 255;
            --else
                --din <= 0;
            --end if;
        --end loop;


  wait;
  end process;

END;

From what I'm doing when start = 1 the matrix is read into inputblock. In this case the matrix is just filled with unique incremental values from 0 to 63. Then when done = 1 I output outblock which is the multiplied out matrix. The problem is that in my simulation I receive some values that are supposed to be in the final matrix but aren't in the correct order. For example the line below contains the first row in the multiplied matrix, tempblock:

 14464.000  15157.000  15850.000  16543.000  17236.000  17929.000  18622.000  19315.000

As you can see in the picture of my simulation I get some of those values but then the signal becomes some weird large value.

I have some doubts that maybe din(0), din(1), din(2)...din(n) doesn't correspond to inputblock(0,0), inputblock(0,1), inputblock(0,2) etc. But I went over my behavioral file thoroughly and don't see any issues with it. Is there something wrong with how I've designed my testbench?

Testbench: bottom signals are unsigned values

EDIT: I need help in outputting for this

        din<=0;


    for i in 0 to 63 loop
        wait until clk = '1' and clk'event;
        if i = 0 then
            Start <= '1','0' after clk_period;
            end if;
            if (i < 24) then
                din <= 255;
            elsif (i > 40) then
                din <= 255;
            else
                din <= 0;
            end if;

    end loop;

I thought it would be similar to the code in the answer but I ran into the same exact issue. How would this be fixed? Here is a picture of what is currently outputted. The correct values are there but just shifted by one clock period. enter image description here

FINAL EDIT: Solved it myself. The problem was with the loop boundaries.

Upvotes: 3

Views: 11355

Answers (1)

user1155120
user1155120

Reputation:

Here's what looks to be a working version of your model and it's testbench

Added (and updated)

If you were to make the the matrix multiple take real time (clocks), you'd see DONE delayed by he number of clocks it took to do the matrix multiply. I arbitrarily picked two clocks just to show the benefit of the added register files.

I'll comment on the interesting parts of the code.

LIBRARY ieee;
USE ieee.std_logic_1164.ALL;

 ENTITY lab4b_tb IS
 END lab4b_tb;

ARCHITECTURE behavior OF lab4b_tb IS 

   signal Clk:      std_logic   := '0';  -- no reset
   signal Start:    std_logic   := '0';  -- no reset
   signal Din:      INTEGER     := 0;     -- no reset

   signal Done : std_logic;
   signal Dout : INTEGER;

   constant Clk_period : time := 10 ns;

BEGIN

   uut: entity work.DCT_beh -- DCT_beh 
       PORT MAP (
           Clk => Clk,
           Start => Start,
           Din => Din,
           Done => Done,
           Dout => Dout
      );

CLOCK: 
    process
    begin
        Clk <= '0';
        wait for Clk_period/2;
        Clk <= '1';
        wait for Clk_period/2;
    end process;

STIMULUS: 
    process
        variable i, j : INTEGER;
        variable cnt : INTEGER;
    begin     

         wait until clk = '1' and clk'event;  -- sync Start to clk

FIRST_BLOCK_IN:
        Start <= '1','0' after 11 ns;  --issued same time as datum 0
        for i in 0 to 63 loop
                if (i < 24) then
                    din <= 255;
                elsif (i > 40) then
                    din <= 255;
                else
                    din <= 0;
                end if;
                wait until clk = '1' and clk'event;
        end loop;
SECOND_BLOCK_N:
        Start <= '1','0' after 11 ns;  -- with first datum
        for cnt in 0 to 63 loop
            din <= cnt; 
            wait until clk = '1' and clk'event;
        end loop;
        din <= 0;  -- to show the last input datum clearly

        wait;
    end process;

END ARCHITECTURE;

The two input blocks are you new block value and your original block value which provided an index for the first output block. The second block also shows the same answers as originally, validating the DONE handshaking.

Note Start is concurrent with the first datum of each block.

I also adjusted the input stimulus to start out on a clock boundary to not have the first Start show on falling edges of clocks.

Where there are asynchronously generated pulses I extended them a nanosecond to insure they'd be seen on a clock edge, because they weren't generated on a clock edge.

LIBRARY ieee;
USE ieee.std_logic_1164.ALL;

entity DCT_beh is
    port (
        Clk :           in std_logic;
        Start :         in std_logic;
        Din :           in INTEGER;
        Done :          out std_logic;
        Dout :          out INTEGER
      );

 end DCT_beh;

architecture behavioral of DCT_beh is 
    type RF is array ( 0 to 7, 0 to 7 ) of INTEGER;
    signal OutBlock:            RF;
    signal InBlock:             RF;
    signal internal_Done:       std_logic := '0';  -- no reset
    signal Input_Ready:         std_logic := '0';  -- no reset
    signal done_detected:       std_logic := '0';  -- no reset
    signal input_rdy_detected:  std_logic := '0';  -- no reset
    signal last_out:            std_logic := '0';  -- no reset

begin
INPUT_DATA:
    process
    begin
        wait until Start = '1';
        --Read Input Data
        for i in 0 to 7 loop
            for j in 0 to 7 loop    
                wait until Clk = '1' and clk'event;
                InBlock(i,j) <= Din;
                if i=7 and j=7 then
                    Input_Ready <= '1', '0' after 11 ns;  
                end if;
            end loop;
        end loop;
    end process;

WAIT_FOR_InBlock:
    process
    begin   
        wait until clk = '1' and clk'event;
        input_rdy_detected <= Input_Ready;  
        --InBlock valid after the following rising edge of clk
    end process;

TRANSFORM:
    process 
            variable InpBlock       : RF;
            constant COSBlock       : RF :=
            ( 
                ( 125,   122,   115,    103,    88,     69,     47,      24  ),
                ( 125,   103,    47,    -24,   -88,   -122,   -115,     -69  ),
                ( 125,    69,   -47,   -122,   -88,     24,    115,     103  ),
                ( 125,    24,  -115,    -69,    88,    103,    -47,    -122  ),
                ( 125,   -24,  -115,     69,    88,   -103,    -47,     122  ),
                ( 125,   -69,   -47,    122,   -88,    -24,    115,    -103  ),
                ( 125,  -103,    47,     24,   -88,    122,   -115,      69  ),
                ( 125,  -122,   115,   -103,    88,    -69,     47,     -24  )
            );
            variable TempBlock      : RF;
            variable A, B, P, Sum   : INTEGER; 
    begin

        if input_rdy_detected = '0' then
            wait until input_rdy_detected = '1';
        end if;

        InpBlock := InBlock;  -- Broadside dump or swap

--TempBlock = COSBLOCK * InBlock  

-- arbitrarily make matrix multiple 2 clocks long      
      wait until clk = '1' and clk'event;  -- 1st xfm clock

        for i in 0 to 7 loop
            for j in 0 to 7 loop
                Sum := 0;
                for k in 0 to 7 loop
                    A := COSBlock( i, k ); 
                    B := InpBlock( k, j ); 
                    P := A * B; 
                    Sum := Sum + P; 
                    if( k = 7 ) then 
                        TempBlock( i, j ) := Sum;
                    end if;
                end loop;
            end loop;
        end loop;

  --  Done issued in clk cycle of last TempBlock( i, j )  := Sum;

        internal_Done <= '1', '0' after 11 ns;  
        wait until clk = '1' and clk'event;  -- 2nd xfrm clk   
        -- OutBlock available after last TempBlock value stored   

        OutBlock <= TempBlock;   -- Broadside dump or swap
    end process;

Done_BUFFER:
    Done <= internal_Done;


WAIT_FOR_OutBlock:
    process
    begin
        wait until clk = '1' and clk'event;
        done_detected <= internal_Done;
        -- Done can come either before the first output_data transfer
        -- or during the last output data transfer
        -- this gives us the clock delay to finish the last xfm transfer to 
        -- TempBlock( i, j)
        -- Technically part of the output process but was too cumbersome to write
    end process;

OUTPUT_DATA:
    process
    begin
        -- OutBlock is valid after clock edge when Done is true
        for i in 0 to 7 loop
            for j in 0 to 7 loop

                if i = 0 and j = 0 then

                    if done_detected = '0' then
                        wait until done_detected = '1';
                    end if; 
                end if;  

                Dout <=  OutBlock(i,j);                        
                wait until clk = '1' and clk'event;
            end loop;
        end loop;
    end process;

end behavioral;

The type definition for RF has been moved to the architecture declarative part to allow inter process communications through signals. The input loop, matrix multiply and output loop are in there own processes. I also added processes for the inter-process handshaking (Input_Ready and input_Done (Done), added signals input_rdy_detect and done_detect.

If a process can take 64 clocks a signal showing the last datum process (Input_Ready and potentially Done) are exerted during the last data transaction of the downstream process. It would be very messy to code otherwise and you'd still need the flip flops.

There's an added RF between the input process and the multiply process to allow concurrent operation when the matrix multiply takes real time (and it takes 2 clocks in this example, I didn't want to stretch out the waveforms too far).

Some of the handshaking delays appear to have been coding style related and cured with the input_rdy_detect and done_detect flip flops.

The first waveform diagram shows the first output data following the two clocks the transform process now takes, shown between A and B markers.

Two Clock Matrix Multiply

You can see the first output datum following immediately following Done is 78540 and not the 110415 shown in your waveform screen capture. One of us shows the wrong value. This version of DCT_beh strictly enforces transfers of RF values only after the last datum is loaded.

I did get the 110415 value before cleaning up the handshaking between the input process and multiply process. It'd be a lot of work to trace it through the TempBlock our OutBlock.

Now for the good news. The second input block is taken from your original stimulus and the input values make a great index for the output transfers. Those output data values all appear correct.

2nd Block Done and start of 2nd block output

The signals input_rdy_detect and done_detect happen to show the first transaction in their respective down stream processes. I added a trailing din signal assignment to 0 avoiding confusion at the end of second input block.

Here's a screen capture approximating yours, I can't do selected zoom, instead use successive approximation.

enter image description here

You only need to run the simulation out to 1955 ns to capture the last datum of the 2nd block being out.

This was done using Tristan Gingold's ghdl and Tony Bybell's gtkwave on a Mac running OS X 10.8.4.

Upvotes: 2

Related Questions