Reputation: 193
I am having some trouble with a FFT signal cross correlation module I'm creating, (makes use of circular convolution theorem, etc etc). I'd like to just confirm that the following scheme will ensure that particular recursion levels of FFT butterfly computation are finished before the next level starts and that the buffers containing the data are fully written to/done with. So the circular correlation/convolution involves an FFT, a vector-wise inner product and then an IFFT.
Because of this scheme, I have no kernels which order the data in bit-reversed index order. The forward FFT kernel produces a bit-reversed-order FFT and, after the inner products, the IFFT just uses this result to compute a natural order solution.
I should mention I have multiple GPU's.
Anyways, here is a pseudo-code representation of what is going on for every FFT/IFFT, (access/operation algorithms are equivalent, other than conjugated twiddle factors, the normalization kernel comes later:
for numDevices:
data -> buffers
buffers -> kernel arguments
for fftRecursionLevels:
for numDevices:
recursionLevel -> kernel arguments
deviceCommandQueue -> enqueueNDRangeKernel
deviceCommandQueue -> flush()
for numDevices:
deviceCommandQueue -> finish()
(EDIT: The method is Radix-2 DIT, if that is not clear, sorry.)
Can I get away with that? As far as I understand finish() is a blocking function and that very last for loop would not complete until every kernel is finished computing across its global range, (here fftSize / 2, see any literature on Radix-2 butterfly operations), and, for bonus points, some of the kernels are executing already due to flush() while I'm enqueueing the remaining kernels.
Overall, I'm geting some funky/garbage results using openCL/c++ for this particular software. I have implemented the full data pipeline in python, (the algorithm is, "topologically equivalent" if you will, obviously there is no host<-->device buffer/instructions or device-side operations w/ the python method), and simulated how the kernel should run and it produces IDENTICAL results to when I use scipy.fftpack modules and just operate on the signal data vectors.
I guess some pictures would help. Here is exactly what is happening in both programs.
1) generate gaussian vector 2) zero pad gaussian vector to next next highest power of 2 length 3) forward FFT, resulting in a natural order (in w ) result 4) plot
Here is the python simulation of my kernel, compared to just using scipy.fftpack.fft( vector ) :
https://i.sstatic.net/gXbbV.png
They are the same. Now compare that to either of:
https://i.sstatic.net/gE6R0.png
(Ignore the indeces on the x-axes, they are all natural order FFT results)
They are all the same type of starting data, (guassian from 0 to N, centered at N/2, and zero padded to 2N in this case). And they should all look like the green/blue line in image one, but they do not. My eyes have glazed over from how long I've been staring at the host/device code for that second program and I do not see any typos or incorrect algorithms. I highly suspect that something is happening device side that I am not aware of, hence my posting here. Clearly the algorithm LOOKS like its operating correctly, (the general shape of the red/red is approximately that of blue/green no matter the starting data. I've ran the algorithm on different starting sets and it consistently looks like blue/green but with that nonsensical noise/error), but something is amiss.
So I turn to the interwebs. Thanks in advance.
EDIT: one of the posters below suggested that it is difficult to comment w/o seeing at least device side code, as there are questions about mem fencing, so I have posted the kernel code below.
//fftCorr.cl
//
//OpenCL Kernels/Methods for FFT Cross Correlation of Signals
//
//Copyright (C) 2013 Steve Novakov
//
//This file is part of OCLSIGPACK.
//
//OCLSIGPACK is free software: you can redistribute it and/or modify
//it under the terms of the GNU General Public License as published by
//the Free Software Foundation, either version 3 of the License, or
//(at your option) any later version.
//
//OCLSIGPACK is distributed in the hope that it will be useful,
//but WITHOUT ANY WARRANTY; without even the implied warranty of
//MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
//GNU General Public License for more details.
//
//You should have received a copy of the GNU General Public License
//along with OCLSIGPACK. If not, see <http://www.gnu.org/licenses/>.
//
#define PIE 3.14159265359f
void DFT2(float2 * a,float2 * b, float2 *w){
float2 tmp;
float2 bmul = ( (*w).x*((*b).x) - (*w).y*((*b).y), (*w).x*((*b).y) + (*w).y*((*b).x) );
tmp = (*a) - bmul;
(*a) += bmul;
(*b) = tmp;
}
//
//
// Spin Factor Calc
//
// Computes spin/twiddle factor for particular bit reversed index.
//
//
float2 spinFact(unsigned int N, unsigned int k)
{
float phi = -2.0 * PIE * (float) k / (float) N;
// \bar{w}^offset_(groupDist)
float spinRe, spinIm;
spinIm = sincos( phi, &spinRe);
return (float2) (spinRe, spinIm);
}
float2 spinFactR(unsigned int N, unsigned int k)
{
float phi = 2.0 * PIE * (float) k / (float) N;
// w^offset_(groupDist)
float spinRe, spinIm;
spinIm = sincos( phi, &spinRe);
return (float2) (spinRe, spinIm);
}
//
// Bit-Reversed Index Reversal, (that sounds confusing)
//
unsigned int BRIR( unsigned int index, unsigned int fftDepth)
{
unsigned int rev = index;
rev = (((rev & 0xaaaaaaaa) >> 1 ) | ((rev & 0x55555555) << 1 ));
rev = (((rev & 0xcccccccc) >> 2 ) | ((rev & 0x33333333) << 2 ));
rev = (((rev & 0xf0f0f0f0) >> 4 ) | ((rev & 0x0f0f0f0f) << 4 ));
rev = (((rev & 0xff00ff00) >> 8 ) | ((rev & 0x00ff00ff) << 8 ));
rev = (((rev & 0xffff0000) >> 16) | ((rev & 0x0000ffff) << 16));
rev >>= (32-fftDepth);
return rev;
}
//
//
// Index Bit Reversal Kernel, if Necessary/for Testing.
//
// Maybe I should figure out an in-place swap algorithm later.
//
//
__kernel void bitRevKernel( __global float2 * fftSetX,
__global float2 * fftSetY,
__global float2 * fftRevX,
__global float2 * fftRevY,
unsigned int fftDepth
)
{
unsigned int glID = get_global_id(0);
unsigned int revID = BRIR(glID, fftDepth);
fftRevX[revID] = fftSetX[glID];
fftRevY[revID] = fftSetY[glID];
}
//
//
// FFT Radix-2 Butterfly Operation Kernel
//
// This is an IN-PLACE algorithm. It calculates both bit-reversed indeces and spin factors in the same thread and
// updates the original set of data with the "butterfly" results.
//
// recursionLevel is the level of recursion of the butterfly operation
// # of threads is half the vector size N/2, (glID is between 0 and this value, non inclusive)
//
// Assumes natural order data input. Produces bit-reversed order FFT output.
//
//
__kernel void fftForwKernel( __global float2 * fftSetX,
__global float2 * fftSetY,
unsigned int recursionLevel,
unsigned int totalDepth
)
{
unsigned int glID = get_global_id(0);
unsigned int gapSize = 1 << (recursionLevel - 1);
unsigned int groupSize = 1 << recursionLevel;
unsigned int base = (glID >> (recursionLevel - 1)) * groupSize;
unsigned int offset = glID & (gapSize - 1 );
unsigned int bitRevIdA = (unsigned int) base + offset;
unsigned int bitRevIdB = (unsigned int) bitRevIdA + gapSize;
unsigned int actualIdA = BRIR(bitRevIdA, totalDepth);
unsigned int actualIdB = BRIR(bitRevIdB, totalDepth);
float2 tempXA = fftSetX[actualIdA];
float2 tempXB = fftSetX[actualIdB];
float2 tempYA = fftSetY[actualIdA];
float2 tempYB = fftSetY[actualIdB];
float2 spinF = spinFact(groupSize, offset);
// size 2 DFT
DFT2(&tempXA, &tempXB, &spinF);
DFT2(&tempYA, &tempYB, &spinF);
fftSetX[actualIdA] = tempXA;
fftSetX[actualIdB] = tempXB;
fftSetY[actualIdA] = tempYA;
fftSetY[actualIdB] = tempYB;
}
For the data provided in the picture. I run "fftForwKernel" as described at the beggining of the post, and then I run "bitRevKernel"
Upvotes: 1
Views: 1033
Reputation: 5390
So, without code to inform anything, and running under the assumption that the codes really are 'the same', I'm inclined to say that assuming the algorithm being used really appears to be the same between Python and OpenCL it's probably a synchronization problem. If not a global synchronization problem (properly splitting up work between kernel invocations), then a problem with missing global memory fences or even local memory fences per local group per kernel invocation.
How are you organizing the calls? The pseudo-code provided seems ambiguous with regards to how exactly it is that you're splitting up the recursive FFTs. I'm guessing that you're doing the 'right thing' for a... Well, I'm not even sure if you're doing DIT or DIF or any other of the huge variety of data flows available to FFT algorithms. For all I know, it could be that you're doing butterflies without properly memfencing them, or you may be so strictly synchronizing your FFT steps that butterflies are part of a whole different kernel invocation than the recursive FFT steps and my suggestions are totally null and void.
(I would have watered this down into a comment, but I lack the reputation to do so, so my apologies that this was posted as an 'answer')
Upvotes: 1