Reputation: 51

Operations with complex matrices using jCUDA

What is the best way to operate with complex numbers using jCuda? Should I use cuComplex format or is there any other solution (like an array with real and imaginary parts going one after another)? I would really appreciate examples of java code with this type of calculations.

As my purpose is to solve big systems of linear equations with complex numbers using GPU, I would not like to attach to jCuda only. What are the alternative ways to conduct such calculations with GPU?

Upvotes: 4

Answers (1)

Marco13

Reputation: 54679

First, concerning your question about computations with Java on the GPU in general, I wrote a few words about this here.

Your application case seems to be very specific. You should probably describe in more detail what your actual intention is, because this will govern all design decisions. Until now, I only can give some basic hints. Deciding which is the most appropriate solution is up to you.

One of the main difficulties when bridging between the Java world and the GPU world is the fundamentally different memory handling.

Memory layout in C/C++

The cuComplex struct in CUDA is defined as

typedef float2 cuFloatComplex
typedef cuFloatComplex cuComplex;

where float2 is basically something like a

struct float2 {
    float x; 
    float y; 
};

(with some additional specifiers for alignment etc.)

Now, When you allocate an array of cuComplex values in a C/C++ program, then you'll just write something like

cuComplex *c = new cuComplex[100];

In this case, it is guaranteed that the memory of all these cuComplex values will be a single, contiguous memory block. This memory block just consists of all the x and y values of the complex numbers, one after the other:

      _____________________________
c -> | x0 | y0 | x1 | y1 | x2 | y2 |... 
     |____|____|____|____|____|____|

This contiguous memory block can easily be copied to the device: One takes the pointer, and invokes a call like

cudaMemcpy(device, c, sizeof(cuComplex)*n, cudaMemcpyHostToDevice);

Memory layout in Java

Consider the case where you create a Java class that is structurally equal to the cuComplex struct, and allocate an array of these:

 class cuComplex {
     public float x;
     public float y;
 }

 cuComplex c[] = new cuComplex[100];

Then you do not have a single contiguous memory block of float values. Instead, you have an array of references to cuComplex objects, and the respective x and y values are scattered all around:

      ____________________
c -> |  c0  |  c1  |  c2  |... 
     |______|______|______|
        |       |      |
        v       v      v
      [x,y]   [x,y]  [x,y]

The key point here is:

You can't copy an (Java) array of cuComplex objects to the device!.

This has several implications. In the comments, you already referred to the cublasSetVector method that takes a cuComplex array as an argument, and where I tried to emphasize that this is not the most efficient solution, but only there for convenience. In fact, this method works by internally creating a new ByteBuffer in order to have a contiguous memory block, filling this ByteBuffer with the values from the cuComplex[] array, and then copying this ByteBuffer to the device.

Of course, this imposes an overhead that you'd most likely want to avoid in performance-critical applications.

There are several options for tackling this problem. Fortunately, for complex numbers, the solution is comparatively easy:

Don't use the cuComplex structure to represent arrays of complex numbers

Instead, you should represent your array of complex numbers as a single, contiguous memory block, where the real and imaginary parts of the complex numbers are interleaved, each being a single float or double value, respectively. This will allow the greatest interoperability between different backends (leaving out certain details, like alignment requirements).

Unfortunately, this may cause some inconveniences and raise some questions, and there is no one-fits-all solution for this.

If one tried to generalize this, to not only refer to complex numbers, but to "structures" in general, then there is a "pattern" that could be applied: One could create interfaces for the structures, and create a collection of these structures that is a list of instances of classes implementing this interface, which are all backed by a contiguous memory block. This may be appropriate for certain cases. But for complex numbers, the memory overhead of having a Java object for each complex number may be prohibitively large.

The other extreme, of only handling raw float[] or double[] arrays may also not be the best solution. For example: If you have an array of float values that represents complex numbers, then how could you multiply one of these complex numbers with another one?

One "intermediate" solution could to create an interface that allows accessing the real and imaginary parts of complex numbers. In the implementation, these complex numbers are stored in a single array, as described above.

I sketched such an implementation here.

NOTE:

This is only intended as an example, to show the basic idea, and to show how it may work together with something like JCublas. For you, a different strategy may be more appropriate, depending on your actual goals: Which other backends (besides JCuda) should there be? How "convenient" should it be to handle the complex numbers on Java side? What should the structures (classes/interfaces) for handling the complex numbers on the Java side look like?

For short: You should have a very clear idea about what your application/library should be able to do, before proceeding with the implementation.

import static jcuda.jcublas.JCublas2.*;
import static jcuda.jcublas.cublasOperation.CUBLAS_OP_N;
import static jcuda.runtime.JCuda.*;

import java.util.Random;

import jcuda.*;
import jcuda.jcublas.cublasHandle;
import jcuda.runtime.cudaMemcpyKind;

// An interface describing an array of complex numbers, residing
// on the host, with methods for accessing the real and imaginary
// parts of the complex numbers, as well as methods for copying
// the underlying data from and to the device
interface cuComplexHostArray
{
    int size();

    float getReal(int i);
    float getImag(int i);

    void setReal(int i, float real);
    void setImag(int i, float imag);

    void set(int i, cuComplex c);
    void set(int i, float real, float imag);

    cuComplex get(int i, cuComplex c);

    void copyToDevice(Pointer devicePointer);
    void copyFromDevice(Pointer devicePointer);
}

// A default implementation of a cuComplexHostArray, backed
// by a single float[] array
class DefaultCuComplexHostArray implements cuComplexHostArray
{
    private final int size;
    private final float data[];

    DefaultCuComplexHostArray(int size)
    {
        this.size = size;
        this.data = new float[size * 2];
    }

    @Override
    public int size()
    {
        return size;
    }

    @Override
    public float getReal(int i)
    {
        return data[i+i];
    }

    @Override
    public float getImag(int i)
    {
        return data[i+i+1];
    }

    @Override
    public void setReal(int i, float real)
    {
        data[i+i] = real;
    }

    @Override
    public void setImag(int i, float imag)
    {
        data[i+i+1] = imag;
    }

    @Override
    public void set(int i, cuComplex c)
    {
        data[i+i+0] = c.x;
        data[i+i+1] = c.y;
    }

    @Override
    public void set(int i, float real, float imag)
    {
        data[i+i+0] = real;
        data[i+i+1] = imag;
    }

    @Override
    public cuComplex get(int i, cuComplex c)
    {
        float real = getReal(i);
        float imag = getImag(i);
        if (c != null)
        {
            c.x = real;
            c.y = imag;
            return c;
        }
        return cuComplex.cuCmplx(real, imag);
    }

    @Override
    public void copyToDevice(Pointer devicePointer)
    {
        cudaMemcpy(devicePointer, Pointer.to(data),
            size * Sizeof.FLOAT * 2,
            cudaMemcpyKind.cudaMemcpyHostToDevice);
    }

    @Override
    public void copyFromDevice(Pointer devicePointer)
    {
        cudaMemcpy(Pointer.to(data), devicePointer,
            size * Sizeof.FLOAT * 2,
            cudaMemcpyKind.cudaMemcpyDeviceToHost);
    }
}

// An example that performs a "gemm" with complex numbers, once
// in Java and once in JCublas2, and verifies the result
public class JCublas2ComplexSample
{
    public static void main(String args[])
    {
        testCgemm(500);
    }

    public static void testCgemm(int n)
    {
        cuComplex alpha = cuComplex.cuCmplx(0.3f, 0.2f);
        cuComplex beta  = cuComplex.cuCmplx(0.1f, 0.7f);
        int nn = n * n;

        System.out.println("Creating input data...");
        Random random = new Random(0);
        cuComplex[] rhA = createRandomComplexRawArray(nn, random);
        cuComplex[] rhB = createRandomComplexRawArray(nn, random);
        cuComplex[] rhC = createRandomComplexRawArray(nn, random);

        random = new Random(0);
        cuComplexHostArray hA = createRandomComplexHostArray(nn, random);
        cuComplexHostArray hB = createRandomComplexHostArray(nn, random);
        cuComplexHostArray hC = createRandomComplexHostArray(nn, random);

        System.out.println("Performing Cgemm with Java...");
        cgemmJava(n, alpha, rhA, rhB, beta, rhC);

        System.out.println("Performing Cgemm with JCublas...");
        cgemmJCublas(n, alpha, hA, hB, beta, hC);

        boolean passed = isCorrectResult(hC, rhC);
        System.out.println("testCgemm "+(passed?"PASSED":"FAILED"));
    }

    private static void cgemmJCublas(
        int n,
        cuComplex alpha,
        cuComplexHostArray A,
        cuComplexHostArray B,
        cuComplex beta,
        cuComplexHostArray C)
    {
        int nn = n * n;

        // Create a CUBLAS handle
        cublasHandle handle = new cublasHandle();
        cublasCreate(handle);

        // Allocate memory on the device
        Pointer dA = new Pointer();
        Pointer dB = new Pointer();
        Pointer dC = new Pointer();
        cudaMalloc(dA, nn * Sizeof.FLOAT * 2);
        cudaMalloc(dB, nn * Sizeof.FLOAT * 2);
        cudaMalloc(dC, nn * Sizeof.FLOAT * 2);

        // Copy the memory from the host to the device
        A.copyToDevice(dA);
        B.copyToDevice(dB);
        C.copyToDevice(dC);

        // Execute cgemm
        Pointer pAlpha = Pointer.to(new float[]{alpha.x, alpha.y});
        Pointer pBeta = Pointer.to(new float[]{beta.x, beta.y});
        cublasCgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, n, n,
            pAlpha, dA, n, dB, n, pBeta, dC, n);

        // Copy the result from the device to the host
        C.copyFromDevice(dC);

        // Clean up
        cudaFree(dA);
        cudaFree(dB);
        cudaFree(dC);
        cublasDestroy(handle);
    }

    private static void cgemmJava(
        int n,
        cuComplex alpha,
        cuComplex A[],
        cuComplex B[],
        cuComplex beta,
        cuComplex C[])
    {
        for (int i = 0; i < n; ++i)
        {
            for (int j = 0; j < n; ++j)
            {
                cuComplex prod = cuComplex.cuCmplx(0, 0);
                for (int k = 0; k < n; ++k)
                {
                    cuComplex ab =
                        cuComplex.cuCmul(A[k * n + i], B[j * n + k]);
                    prod = cuComplex.cuCadd(prod, ab);
                }
                cuComplex ap = cuComplex.cuCmul(alpha, prod);
                cuComplex bc = cuComplex.cuCmul(beta, C[j * n + i]);
                C[j * n + i] = cuComplex.cuCadd(ap, bc);
            }
        }
    }

    private static cuComplex[] createRandomComplexRawArray(
        int n, Random random)
    {
        cuComplex c[] = new cuComplex[n];
        for (int i = 0; i < n; i++)
        {
            float real = random.nextFloat();
            float imag = random.nextFloat();
            c[i] = cuComplex.cuCmplx(real, imag);
        }
        return c;
    }

    private static cuComplexHostArray createRandomComplexHostArray(
        int n, Random random)
    {
        cuComplexHostArray c = new DefaultCuComplexHostArray(n);
        for (int i = 0; i < n; i++)
        {
            float real = random.nextFloat();
            float imag = random.nextFloat();
            c.setReal(i, real);
            c.setImag(i, imag);
        }
        return c;
    }

    private static boolean isCorrectResult(
        cuComplexHostArray result, cuComplex reference[])
    {
        float errorNormX = 0;
        float errorNormY = 0;
        float refNormX = 0;
        float refNormY = 0;
        for (int i = 0; i < result.size(); i++)
        {
            float diffX = reference[i].x - result.getReal(i);
            float diffY = reference[i].y - result.getImag(i);
            errorNormX += diffX * diffX;
            errorNormY += diffY * diffY;
            refNormX += reference[i].x * result.getReal(i);
            refNormY += reference[i].y * result.getImag(i);
        }
        errorNormX = (float) Math.sqrt(errorNormX);
        errorNormY = (float) Math.sqrt(errorNormY);
        refNormX = (float) Math.sqrt(refNormX);
        refNormY = (float) Math.sqrt(refNormY);
        if (Math.abs(refNormX) < 1e-6)
        {
            return false;
        }
        if (Math.abs(refNormY) < 1e-6)
        {
            return false;
        }
        return
            (errorNormX / refNormX < 1e-6f) &&
            (errorNormY / refNormY < 1e-6f);
    }
}

(By the way: I'll probably take parts of this answer and extend them to become samples and/or "How To..." pages for JCuda. The task of providing some information like this has already been on my "todo" list for quite a while).

Upvotes: 3

Operations with complex matrices using jCUDA

Answers (1)

Memory layout in C/C++

Memory layout in Java

NOTE:

Related Questions