user2475221
user2475221

Reputation: 21

How do I setup my Cuda thread blocks and grids for eight dimensions?

I do three dimensional cuda work on regular basis but now I have a problem with eight dimensions.

How do I setup my thread blocks and grids for eight dimensions?

For example in 3d I use:

    grid.x=int(321);  
    grid.y=int(321);

    foo<<<grid,int(321)>>>

So each thread gets it’s own x,y,z address and tries to solve a problem using only that x,y,z. and if the result is positive results get reported. I know how to pass the memory pointers to report the results.

Now I have a new problem that I want to try values of 0 to 11 on eight different axis’s instead 0 to 321 on three. I assume each thread should get it’s set of (a,b,c,d,e,f,g,h) parameters via the thread blocks, grid.

How do I map eight dimensions via grid and block sizes into my kernel?

Upvotes: 2

Views: 1445

Answers (2)

user2475221
user2475221

Reputation: 21

Here's my answer to my question, a short Cuda program that finds unitless equations, on eight axis. The search size on the the second line should be a odd integer greater than five. The program outputs powers of c0 h0 A0 R0 W0 G0 N0 F0 in row form.

I might be wrong but I think the structure is a old school short circuit logic structure The threads diverge and report the right answers; and otherwise the threads just converge and run at full speed with the wrong ones!

Here is unitless equation, for the powers of the constants, that I am solving. I think the program is gnarly because the equation is gnarly. Each parenthesis group has to go to zero to make the equation true. One could use squares or abs or pure logic like my method.

(x4-x5)^2+(x3+x8)^2+(x4+x7+x8)^2+(x2+x4-x6)^2+(-2*(x4+x6)+x8-x1-x2)^2+(2*(x2+x4)+x1+x5+3*x6)^2 = 0

The code slows down for right answers; but they are rare, thus it running near full speed most of the time. Once found, a right answer is easy to check. All rights reserved, M. Snyder June 2013 I have have tired a range of 0 to 59 on eight axis and it ran in a few hours on a GTX 560. Now trying a range of 0 to 95 on my GTX 480.

If anyone can help me to get it to run faster, I would be interested...

#include "stdio.h"
#define searchsize 27
//nvcc helloworld.cu -o helloworld -arch=sm_21 -maxrregcount=20 -ccbin=gcc-4.4
__global__ void helloworld()
{
int x1,x2,x3,x4,x5,x6,x7,x8,rlow,rhgh;

rlow=-((gridDim.x-1)/2);
rhgh=((gridDim.x-1)/2);
x1=blockIdx.x+rlow;
x2=blockIdx.y+rlow;
x3=threadIdx.x+rlow;
x4=rlow;
x5=rlow;
x6=rlow;
x7=rlow;
x8=rlow;

while (x8<=rhgh)
{
if (x4 == x5)
{
if (x3 == -x8)
{
if (x4 + x7 == -x8)
{
if (x2+x4 == x6)
{
if (-2*( x4 + x6) + x8 == x1 + x2)
{
if (2*(x2+x4) + x1 + x5 == -3*x6)
{
printf("%+4d,%+4d,%+4d,%+4d,%+4d,%+4d,%+4d,%+4d \n", x1,x2,x3,x4,x5,x6,x7,x8);
}
}
}
}
}
}
x4=x4+1;
if (x4>rhgh)
{
x5=x5+1;
x4=rlow;
}
if (x5>rhgh)
{
x6=x6+1;
x5=rlow;
}
if (x6>rhgh)
{
x7=x7+1;
x6=rlow;
}
if (x7>rhgh)
{
x8=x8+1;
x7=rlow;
}
}
}

int main()
{
int rangeofsearch(searchsize);
dim3 grid,block;
grid.x=rangeofsearch;
grid.y=rangeofsearch;
block.x=rangeofsearch;
size_t buf=1e7;

cudaDeviceSetLimit(cudaLimitPrintfFifoSize, buf);
helloworld<<<grid,block>>>();
cudaDeviceSynchronize();
return 0;
}

Sample Output, powers in row form.

c0, h0, A0, R0, W0, G0, N0, F0

-14, -14, +0, +14, +14, +0, -14, +0
-13, -13, +0, +13, +13, +0, -13, +0
-12, -12, +0, +12, +12, +0, -12, +0
-11, -11, +0, +11, +11, +0, -11, +0
-7, -13, -2, +12, +12, -1, -14, +2
-6, -12, -2, +11, +11, -1, -13, +2
-5, -11, -2, +10, +10, -1, -12, +2
+0, -12, -4, +10, +10, -2, -14, +4
+1, -11, -4, +9, +9, -2, -13, +4
+7, -11, -6, +8, +8, -3, -14, +6
-14, -8, +2, +9, +9, +1, -7, -2
-13, -7, +2, +8, +8, +1, -6, -2
-12, -6, +2, +7, +7, +1, -5, -2
-11, -5, +2, +6, +6, +1, -4, -2 
...

Upvotes: 0

CygnusX1
CygnusX1

Reputation: 21779

2D nature of grids and 3D nature of blocks are just a convenience from NVIDIA; they might as well accept just single integers there and the hardware would work in the same way. That is why, if your problem is not inherently 2D or 3D I would suggest using a single dimension indexing and "splitting" the index where needed. Something like this:

int grid = 65536;
int block = 256;
foo<<<grid,block>>>();

and then in your device code:

__device__ int globalIndex() { blockIdx.x * blockDim.x + threadIdx.x; }

__device__ int index8D(int dim) { return (globalIndex() >> (dim*3))%8 }

In the above example, by calling index8D(i) you get the i-th coordinate corresponding to the current thread. However, the valid coordinates are only in range from 0 to 7. You will need a lot more threads to increase that range...

Be warned! Eight dimension cube with even few cells in each dimension is huge! If your 8D space iterates only over [0..7] we are looking on 8^8 cells total (16777216). You may want to consider having a single thread to actually iterate over several cells.


To explain the internals of index8D: I am essentially splitting the binary representation of global thread index into groups of 3 bits per dimension:

101 110 001 000 110 001 101 110

each group now represents an index in one of the dimensions. The left-shift and modulo are used to extract the corresponding 3-bit group. (computing a modulo constant will be optimized by the compiler to bitwise operation; I left it as such for readability)

Upvotes: 1

Related Questions