saman
saman

Reputation: 311

OpenCL Quartus Hardware Generation time consuming

I have recently purchased a Nallatech PCI-Express FPGA Board, which I'm developing OpenCL applications for it. My main problem is the extensive compilation time of the OpenCL into the hardware code, which I believe comes from quartus Hardware Generation Stage. For example a really simple OpenCL code takes at least 7 to 8 hours to be finished and ready to deploy.

I don't have a hardware background, specifically working with quartus. All I see is quartus is only utilizing a single core of my server. Is there any way to configure the hardware generation part, in order to utilizing all resources in my machine, in order to boost the building process performance?

Here is a sample code I have for compilation:

#pragma OPENCL EXTENSION cl_khr_fp64: enable
__kernel void MAdd16(__global double *data, int nIters) {
  int gid = get_global_id(0), globalSize = get_global_size(0);
  double s = data[gid];
  double16 s0 = s + (double16)(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1,1.1,1.2,1.3,1.4,1.5);
  for (int j=0 ; j<nIters ; ++j) {
 s0=10.0f-s0*0.9899f;
 s0=10.0f-s0*0.9899f;
 s0=10.0f-s0*0.9899f;
 s0=10.0f-s0*0.9899f;
 s0=10.0f-s0*0.9899f;
 s0=10.0f-s0*0.9899f;
 s0=10.0f-s0*0.9899f;
 s0=10.0f-s0*0.9899f;
 s0=10.0f-s0*0.9899f;
 s0=10.0f-s0*0.9899f;
 s0=10.0f-s0*0.9899f;
 s0=10.0f-s0*0.9899f;
 s0=10.0f-s0*0.9899f;
 s0=10.0f-s0*0.9899f;
 s0=10.0f-s0*0.9899f;
 s0=10.0f-s0*0.9899f;
 s0=10.0f-s0*0.9899f;
 s0=10.0f-s0*0.9899f;
 s0=10.0f-s0*0.9899f;
 s0=10.0f-s0*0.9899f;
  }
   data[gid] = s0.s0+s0.s1+s0.s2+s0.s3+s0.s4+s0.s5+s0.s6+s0.s7+s0.s8+s0.s9+s0.sa+s0.sb+s0.sc+s0.sd+s0.se+s0.sf;
}

Upvotes: 0

Views: 79

Answers (1)

Gaslight Deceive Subvert
Gaslight Deceive Subvert

Reputation: 20400

I believe your code compiles slowly because it is difficult to optimize. The loop contains 20 double16 multiplications, equivalent to 320 scalar double multiplications. On my FPGA a double multiplication (including the subtraction) requires four DSPs for a total of 1280 DSPs. Unless you have a very high-end FPGA, this is way more DSPs than what your board is equipped with.

You can reduce the hardware requirements and compilation time by inserting a loop:

for (int j=0 ; j<nIters ; ++j) {
    for (int k = 0; k < 20; k++) {
        s0=10.0f-s0*0.9899f;
    }
}

Hardware compilers don't do such transformation automatically because they affect timing.

Upvotes: 0

Related Questions