tamasfe
tamasfe

Reputation: 157

PyCUDA 2D array implementations (or working with strings)

I'm trying to work with an array of strings(words) in CUDA.

I tried flattening it by creating a single string, but then then to index it, I'd have to go through some of it each time a kernel runs. If there are 9000 words with a length of 6 characters, I'd have to examine 53994 characters in the worst case for each kernel call. So I'm looking for different ways to do it.

Update: Forgot to mention, the strings are of different lengths, so I'd have to find the end of each one.

The next thing I tried was copying each word to different memory locations, and then collect the addresses, and pass that to the GPU as an array with the following code:

# np = numpy

wordList = ['asd','bsd','csd']

d_words = []

for word in wordList:
    d_words.append(gpuarray.to_gpu(np.array(word, dtype=str)))

d_wordList = gpuarray.to_gpu(np.array([word.ptr for word in d_words], dtype=np.int32))

ker_test(d_wordList, block=(1,1,1), grid=(1,1,1))

and in the kernel:

__global__ void test(char** d_wordList) {
    printf("First character of the first word is: %c \n", d_wordList[0][0]);
}

The kernel should get an int32 array of pointers that point to the beginning of each word, effectively being a char** (or int**), but it doesn't work as I expect.

What is wrong with this approach?

Also what are the "standard" ways to work with strings in PyCUDA (or even in CUDA) in general?

Thanks in advance.

Upvotes: 0

Views: 1243

Answers (1)

Robert Crovella
Robert Crovella

Reputation: 151849

After some further thought, I've concluded that for this case of variable-length strings, using an "offset array" may not be much different than 2D indexing (i.e. double-pointer indexing), when considering the issue of data access within the kernel. Both involve a level of indirection.

Here's a worked example demonstrating both methods:

$ cat t5.py
#!python
#!/usr/bin/env python
import time
import numpy as np
from pycuda import driver, compiler, gpuarray, tools
import math
from sys import getsizeof

import pycuda.autoinit

kernel_code1 = """
__global__ void test1(char** d_wordList) {
      (d_wordList[blockIdx.x][threadIdx.x])++;
}
    """

kernel_code2 = """
__global__ void test2(char* d_wordList, size_t *offsets) {
    (d_wordList[offsets[blockIdx.x] + threadIdx.x])++;
}
    """




mod = compiler.SourceModule(kernel_code1)
ker_test1 = mod.get_function("test1")



wordList = ['asd','bsd','csd']

d_words = []

for word in wordList:
    d_words.append(gpuarray.to_gpu(np.array(word, dtype=str)))

d_wordList = gpuarray.to_gpu(np.array([word.ptr for word in d_words], dtype=np.uintp))

ker_test1(d_wordList, block=(3,1,1), grid=(3,1,1))

for word in d_words:
  result = word.get()
  print result

mod2 = compiler.SourceModule(kernel_code2)
ker_test2 = mod2.get_function("test2")
wordlist2 = np.array(['asdbsdcsd'], dtype=str)
d_words2 = gpuarray.to_gpu(np.array(['asdbsdcsd'], dtype=str))
offsets = gpuarray.to_gpu(np.array([0,3,6,9], dtype=np.uint64))
ker_test2(d_words2, offsets, block=(3,1,1), grid=(3,1,1))
h_words2 = d_words2.get()
print h_words2


$ python t5.py
bte
cte
dte
['btectedte']
$

Notes:

  1. for the double-pointer case, the only change from OP's example was to use the numpy.uintp type for the pointer as suggested in the comments by @talonmies

  2. I don't think the double-pointer access of data will necessarily be quicker or slower than the indirection associated with the offset lookup method. One other performance consideration would be in the area of copying data from host to device and vice versa. The double pointer method effectively involves multiple allocations and multiple copy operations, in both directions, I believe. For a lot of strings, this will be noticeable in the host/device data copy operations.

  3. Another possible merit of the offset method is that it is easy to determine the length of each string - just subtract two adjacent entries in the offset list. This could be useful so as to make it easy to determine how many threads can operate on a string in parallel, as opposed to having a single thread work on a string sequentially (or use a method in kernel code to determine string length, or pass the length of each string).

Upvotes: 1

Related Questions