vImage has same performance as normal loop with dispatch_apply

This code is run 9600 times inside nested loops for each go, and it has to run in a total time of under 30ms on an iPhone4S:

vImage_Buffer source = { sourceArea.data, patchSide, patchSide, patchSide };
vImage_Buffer destination = { (uchar*)malloc(patchSide * patchSide * sizeof(uchar)), patchSide, patchSide, patchSide };
vImage_AffineTransform transform = { warpingMatrix(0,0), warpingMatrix(0,1), warpingMatrix(1,0), warpingMatrix(1,1), 0, 0 };

if (vImageAffineWarp_Planar8(&source, &destination, NULL, &transform, 0, kvImageBackgroundColorFill) != kvImageNoError)
{
    NSLog(@"Error in warping!");
}

It doesn't seem as fast, as it takes about 0.0002 seconds for a 10x10 patch. Am I overlooking some big performance mistake?

The problem I want to solve is matching of backward warped images and this is the first step. The patch is warped 64 times in 64 different positions around a point, for each of max 150 points.

Upvotes: 2

Answers (3)

notedible

Reputation: 993

There are probably two changes you can make to improve performance without much change to your code: using the vImage framework to assign your source/destination buffers and creating a temp buffer for reuse within the loop. A third change might be to change your tile size (see end of this answer).

Documentation for vImage/Accelerate Framework recommends using vImageBuffer_Init (see vImage_Utilities.h) to initialize your buffers to ensure that the actual buffer is

sized and aligned for best performance

rather than allocating memory yourself:

vImage_Buffer buffer;
vImage_Error err = vImageBuffer_Init(&buffer, height, width, 8 * sizeof(pixel), kvImageNoFlags);

where pixel in your case will be Pixel_8 since you are using the *_Planar8 functions. Note that you will still need to free buffer.data when you are done with it.

So you should initialize source and destination outside of your loop:

vImage_Buffer source;
vImage_Buffer destination;
vImage_Error err = vImageBuffer_Init(&source, patchSide, patchSide, 8 * sizeof(Pixel_8), kvImageNoFlags);
err = vImageBuffer_Init(&destination, patchSide, patchSide, 8 * sizeof(Pixel_8), kvImageNoFlags);

and copy your data from sourceArea.data to source.data. Keep in mind that source.rowBytes is not likely to equal source.width.

You should also create a temp buffer so that vImageAffineWarp_Planar8 can reuse it every iteration instead of allocating it for every iteration since you have passed in NULL as the third argument. To determine the dimensions of the temp buffer, you call the function exactly as you will during operation but with the flag kvImageGetTempBufferSize since different parameters/flags can require different buffer sizes (see @constant kvImageGetTempBufferSize in vImage_Types.h):

size_t tempBufferSize = vImageAffineWarp_Planar8(&source, &destination, NULL, &transform, 0, kvImageBackgroundColorFill | kvImageGetTempBufferSize);

You would then allocate the temp buffer:

void *tempBuffer = malloc(tempBufferSize);

And finally in your loop you would use tempBuf every time:

if (vImageAffineWarp_Planar8(&source, &destination, &tempBuffer, &transform, 0, kvImageBackgroundColorFill) != kvImageNoError)
{
    NSLog(@"Error in warping!");
}

So to re-cap, source, destination, and tempBuf are all pre-allocated using vImageBuffer_Init prior to the loop, where the required size of tempBuf is determined by calling vImageAffineWarp_Planar8 as you will in your loop, but with the additional flag kvImageGetTempBufferSize. Hopefully that will speed you up a little!

One final thing you might look at if your algorithm supports it would be working on larger tiles or image stripes (see section Tiling / Strip Mining and Multithreading in vImage.h).

Upvotes: 0

Sten

Reputation: 3864

vimage is faster if it can reuse the buffers. So if possible declare and allocate the buffer (or the associated data) outside the loop.

unsigned char *sourceData = (unsigned char*)malloc(patchSide * patchSide * sizeof(uchar));
vImage_Buffer source = {sourceData, patchSide, patchSide, patchSide};

unsigned char *destinationData = (unsigned char*)malloc(patchSide * patchSide * sizeof(uchar));
vImage_Buffer destination = {destinationData, patchSide, patchSide, patchSide};

loop{
   //fill sourceData e.g. through memcpy
   memcpy(sourceData, somedata, patchSide * patchSide * sizeof(uchar));

   if (vImageAffineWarp_Planar8(&source, &destination, NULL, &transform, 0, kvImageBackgroundColorFill) != kvImageNoError)
   {
     NSLog(@"Error in warping!");
   }
   //destinationData contains the result
}

Upvotes: 1

Ian Ollmann

Reputation: 1592

10x10 is a very small image. You easily could be spending most of your time in overhead / malloc. Instruments time trace should help determine where the time is going.

The vector ALU on 4s is also half the width of a 5 or 5s, so doesn't provide as much of a win over scalar.

Upvotes: 3

vImage has same performance as normal loop with dispatch_apply

Answers (3)

Related Questions