program fails for array 30 x 30

Question

This is program for matrix multiplication on CUDA architecture. This code is working fine when size of array is 30 x 30 but giving output as a series of 0's when size is greater. I am using standard ec2 instance for CUDA hosted on linux machine. Can anybody figure out the reason ?

#include 
#define SIZE 30

__global__ void matrix_multiply(float *input1,float  *input2,float *output,int dimension){


    int input1_index = threadIdx.x / dimension * dimension;
    int input2_index =  threadIdx.x % dimension;
    int i=0;
    for( i =0; i >>(c_input1,c_input2,c_result,SIZE);
    if(cudaGetLastError()!=cudaSuccess){
        printf("%s
",cudaGetErrorString(cudaGetLastError()));
    }
    cudaMemcpy(result,c_result,sizeof(result),cudaMemcpyDeviceToHost);
    for(i=0;i

Paul R · Accepted Answer

You probably have a max of 1024 threads per block on your GPU. 30 x 30 = 900, so that should be OK, but e.g. 40 x 40 would results in a kernel launch failure (take-home message: always check for errors !).

You probably want to consider organizing your data differently, e.g. SIZE blocks of SIZE threads and then call the kernel as:

matrix_multiply<<>>(c_input1,c_input2,c_result,SIZE);

(Obviously you'll need to modify your array indexing within the kernel code, e.g. use the block index as the row and the thread index as the column.)

program fails for array 30 x 30

Answers (2)

Related Questions