Reputation: 305
I Have cuda/C++ code that returns C++ host-side arrays. I wanted to manipulate these arrays in MATLAB so I rewrote my code in mex format and compiled with mex.
I got it to work by passing preallocated arrays from MATLAB into the mex script but this slowed things down insanely. (54 seconds vs 14 seconds without mex)
Here's the slow solution for a simplified, no input 1 output version of my code:
#include "mex.h"
#include "gpu/mxGPUArray.h"
#include "matrix.h"
#include <stdio.h>
#include <stdlib.h>
#include "cuda.h"
#include "curand.h"
#include <cuda_runtime.h>
#include "math.h"
#include <curand_kernel.h>
#include <time.h>
#include <algorithm>
#include <iostream>
#define iterations 159744
#define transMatrixSize 2592 // Just for clarity. Do not change. No need to adjust this value for this simulation.
#define reps 1024 // Is equal to blocksize. Do not change without proper source code adjustments.
#define integralStep 13125 // Number of time steps to be averaged at the tail of the Force-Time curves to get Steady State Force
__global__ void kern(float *masterForces, ...)
{
int globalIdx = ((blockIdx.x + (blockIdx.y * gridDim.x)) * (blockDim.x * blockDim.y)) + (threadIdx.x + (threadIdx.y * blockDim.x));
...
...
{
...
{
masterForces[i] = buffer[0]/24576.0;
}
}
}
...
}
}
void mexFunction(int nlhs, mxArray *plhs[],
int nrhs, mxArray const *prhs[])
{
...
plhs[0] = mxCreateNumericMatrix(iterations,1,mxSINGLE_CLASS,mxREAL);
float *h_F0 = (float*) mxGetData(plhs[0]);
//Device input vectors
float *d_F0;
..
// Allocate memory for each vector on GPU
cudaMalloc((void**)&d_F0, iterations * sizeof(float));
...
//////////////////////////////////////////////LAUNCH ////////////////////////////////////////////////////////////////////////////////////
kern<<<1, 1024>>>( d_F0);
//////////////////////////////////////////////RETRIEVE DATA ////////////////////////////////////////////////////////////////////////////////////
cudaMemcpyAsync( h_F0 , d_F0 , iterations * sizeof(float), cudaMemcpyDeviceToHost);
///////////////////Free Memory///////////////////
cudaDeviceReset();
////////////////////////////////////////////////////
}
Why so slow?
EDIT: Mex was compiling with an older architecture (SM_13) INSTEAD OF SM_35. Now the time makes sense. (16 s with mex, 14 s with c++/cuda only)
Upvotes: 3
Views: 528
Reputation: 30579
There is no need to use a mxGPUArray
if the outputs of your CUDA code are plain-old-data (POD) host-side (vs. device-side) arrays, like your Forces1
array of float
s created with new
. The MathWorks example that you are referencing is probably demonstrating the use of MATLAB's gpuArray
and built-in CUDA functionality, rather than how to pass data to and from regular CUDA functions within a MEX function.
If you can initialize Forces1
(or h_F0
in your full code) outside and before the CUDA function (e.g. in the mexFunction
), then the solution is just to change from new
to one of the mxCreate*
functions (i.e. mxCreateNumericArray
, mxCreateDoubleMatrix
, mxCreateNumericMatrix
, etc.), and then pass the data pointer to your CUDA function:
plhs[0] = mxCreateNumericMatrix(iterations,1,mxSINGLE_CLASS,mxREAL);
float *h_F0 = (float*) mxGetData(plhs[0]);
// myCudaWrapper(...,h_F0 ,...) /* i.e. cudaMemcpyAsync(h_F0,d_F0,...)
The only changes to your code are thus:
Replace:
float *h_F0 = new float[(iterations)];
with
plhs[0] = mxCreateNumericMatrix(iterations,1,mxSINGLE_CLASS,mxREAL);
float *h_F0 = (float*) mxGetData(plhs[0]);
Remove:
delete h_F0;
Note: If instead your CUDA code owned the output host-side array, then you would have to copy the data into an mxArray
. This is because unless you allocate the mexFunction
outputs with the mx
API, any data buffer you assign (e.g. with mxSetData
) will not be handled by the MATLAB memory manager, and you will have a segfault or at best, a memory leak.
Upvotes: 3