Reputation: 1223
I'm doing simulation testing for some VHDL I wrote and when I run it in ModelSim it gets stuck. When I hit 'break' it has an arrow pointing to the For
loop in the following function:
function MOD_3 (a, b, c : UNSIGNED (1023 downto 0)) return UNSIGNED is
VARIABLE x : UNSIGNED (1023 downto 0) := TO_UNSIGNED(1, 1024);
VARIABLE y : UNSIGNED (1023 downto 0) := a;
VARIABLE b_temp : UNSIGNED (1023 downto 0) := b;
begin
for I in 0 to 1024 loop
if b_temp > 0 then
if b_temp MOD 2 = 1 then
x := (x * y) MOD c;
end if;
y := (y * y) MOD c;
b_temp := b_temp / 2;
else
exit;
end if;
end loop;
return x MOD c;
end function;
I originally had this as a while
loop which I realize is not good for synthesizing. So I converted it to a for
loop with the condition that b_temp
is greater than 0. b_temp
is a 1024-bit unsigned
and so if it is the largest number that could be represented by 1024 bits and divided in half (which I do in each iteration) 1024 times, shouldn't it definitely be 0?
I have a feeling my problem lies in the large multiplications...if I comment out x := (x * y) MOD c
and y := (y * y) MOD c
then it exits the loop. So the only thing I can think of is it takes too long to carry out these 1024-bit multiplications? If this is the case, is there any built-in way I can optimize this to make it faster, or is my only option to implement something like Karatsuba multiplication, etc...?
Upvotes: 0
Views: 938
Reputation:
I'm of the opinion implementing a Montgomery multiplier in numeric_std function calls may not improve simulation as much as you'd like (while giving synthesis eligibility).
The issue is the number of dynamically elaborated subprogram calls vs. their operand sizes vs. fitting in your CPU-running-Modelsim's L1/L2/L3 caches.
It does do wonders for targeting synthesis in an FPGA or a SIMD GPU implementation.
See Subversion Repositories BasicRSA file modmult.vhd (which has a generic size). I successfully converted this to using numeric_std[_unsigned].
If I recall correctly this appears inspired by a Masters thesis (Efficient Hardware Architectures for Modular Multiplication) by David Narh Amanor in 2005 outlining a Java and a VHDL implementation in various word sizes.
I found the OpenCores implementation mentioned in a Stackoverflow question (Montgomery multiplication VHDL Implementation) and found the generic sized version in the SVN repository (the downloadable version is 16 bit) and the mention of the thesis in A 1024 – Bit Implementation of the Faster Montgomery Multiplier Using VHDL (by David Narh Anamor, the original link having expired). Note the quoted FPGA implementation performance under 42 usec.
Notice the length 1024 version specified by a generic would still be performing dynamically elaborated function calls with length 1024 operands (although not the "*"s, the "mod"s or the "/"s. You'd still be doing millions of function calls with dynamically elaborated (passed on an expression stack) 1024 bit parameters. We're simply changing how many millions of large parameter subroutine calls and how long they can take.
And that also brings up the possibility of an integer vector implementation (bignum equivalent) in VHDL, which would potential increase simulation performance even more (and you're likely in uncharted territory here).
A subprogram based version of the OpenCores model using variable parameters would be telling. (Whether or not you can impress anyone showing them a simulation model executing, or whether there's this looong pause interrupted by everyone taking furtive glances at the wall clock and looking bored).
Upvotes: 1