Steveng
Steveng

Reputation: 1521

FPGA timing question

I am new to FPGA programming and I have a question regarding the performance in terms of overall execution time.

I have read that latency is calculated in terms of cycle-time. Hence, overall execution time = latency * cycle time.

I want to optimize the time needed in processing the data, I would be measuring the overall execution time.

Let's say I have a calculation a = b * c * d.

If I make it to calculate in two cycles (result1 = b * c) & (a = result1 * d), the overall execution time would be latency of 2 * cycle time(which is determined by the delay of the multiplication operation say value X) = 2X

If I make the calculation in one cycle ( a = b * c * d). the overall execution time would be latency of 1 * cycle time (say value 2X since it has twice of the delay because of two multiplication instead of one) = 2X

So, it seems that for optimizing the performance in terms of execution time, if I focus only on decreasing the latency, the cycle time would increase and vice versa. Is there a case where both latency and the cycle time could be decreased, causing the execution time to decrease? When should I focus on optimizing the latency and when should I focus on cycle-time?

Also, when I am programming in C++, it seems that when I want to optimize the code, I would like to optimize the latency( the cycles needed for the execution). However, it seems that for FPGA programming, optimizing the latency is not adequate as the cycle time would increase. Hence, I should focus on optimizing the execution time ( latency * cycle time). Am I correct in this if I could like to increase the speed of the program?

Hope that someone would help me with this. Thanks in advance.

Upvotes: 2

Views: 2243

Answers (3)

George
George

Reputation: 1132

I tend to think of latency as the time from the first input to the first output. As there is usually a series of data, it is useful to look at the time taken to process multiple inputs, one after another.

With your example, to process 10 items doing a = b x c x d in one cycle (one cycle = 2t) would take 20t. However doing it in two 1t cycles, to process 10 items would take 11t.

Hope that helps.

Edit Add timing.

Calculation in one 2t cycle. 10 calculations.

Time   0  2  2  2  2  2  2  2  2  2  2  = 20t

Input  1  2  3  4  5  6  7  8  9 10
Output    1  2  3  4  5  6  7  8  9 10

Calculation in two 1t cycles, pipelined, 10 calculations

Time   0  1  1  1  1  1  1  1  1  1  1  1  = 11t

Input  1  2  3  4  5  6  7  8  9 10
Stage1    1  2  3  4  5  6  7  8  9 10
Output       1  2  3  4  5  6  7  8  9 10

Latency for both solutions is 2t, one 2t cycle for the first one, and two 1t cycles for the second one. However the through put of the second solution is twice as fast. Once the latency is accounted for, you get a new answer every 1t cycle.

So if you had a complex calculation that required say 5 1t cycles, then the latency would be 5t, but the through put would still be 1t.

Upvotes: 2

Martin Thompson
Martin Thompson

Reputation: 16832

You need another word in addition to latency and cycle-time, which is throughput. Even if it takes 2 cycles to get an answer, if you can put new data in every cycle and get it out every cycle, your throughput can be increased by 2x over the "do it all in one cycle".

Say your calculation takes 40 ns in one cycle, so a throughput of 25 million data items/sec.

If you pipeline it (which is the technical term for splitting up the calculation into multiple cycles) you can do it in 2 lots of 20ns + a bit (you lose a bit in the extra registers that have to go in). Let's say that bit is 10 ns (which is a lot, butmakes the sums easy). So now it takes 2x25+10=50 ns => 20M items/sec. Worse!

But, if you can make the 2 stages independent of each other (in your case, not sharing the multiplier) you can push new data into the pipeline every 25+a bit ns. This "a bit" will be smaller than the previous one, but even if it's the whole 10 ns, you can push data in at 35ns times or nearly 30M items/sec, which is better than your started with.

In real life the 10ns will bemuch less, often 100s of ps, so the gains are much larger.

Upvotes: 2

flolo
flolo

Reputation: 15526

George described accurately the meaning latency (which does not necessary relate to computation time). Its seems you want to optimize your design for speed. This is very complex and requires much experience. The total runtime is

execution_time = (latency + (N * computation_cycles) ) * cycle_time

Where N is the number of calculations you want to perform. If you develop for acceleration you should only compute on large data sets, i.e. N is big. Usually you then dont have requirements for latency (which could be in real time applications different). The determining factors are then the cycle_time and the computation_cycles. And here it is really hard to optimize, because there is a relation. The cycle_time is determined by the critical path of your design, and that gets longer the fewer registers you have on it. The longer it gets, the bigger is the cycle_time. But the more registers you have the higher is your computation_cycles (each register increases the number of required cycles by one).

Maybe I should add, that the latency is usually the number of computation_cycles (its the first computation that makes the latency) but in theory this can be different.

Upvotes: 1

Related Questions