Reputation: 19937
My computer has both an Intel GPU and an NVIDIA GPU. The latter is much more powerful and is my preferred device when performing heavy tasks. I need a way to programmatically determine which one of the devices to use.
I'm aware of the fact that it is hard to know which device is best suited for a particular task. What I need is to (programmatically) make a qualified guess using the variables listed below.
How would you rank these two devices? Intel HD Graphics 4400
to the left, GeForce GT 750M
to the right.
GlobalMemoryCacheLineSize 64 vs 128
GlobalMemoryCacheSize 2097152 vs 32768
GlobalMemorySize 1837105152 vs 4294967296
HostUnifiedMemory true vs false
Image2DMaxHeight 16384 vs 32768
Image2DMaxWidth 16384 vs 32768
Image3DMaxDepth 2048 vs 4096
Image3DMaxHeight 2048 vs 4096
Image3DMaxWidth 2048 vs 4096
LocalMemorySize 65536 vs 49152
MaxClockFrequency 400 vs 1085
MaxComputeUnits 20 vs 2
MaxConstantArguments 8 vs 9
MaxMemoryAllocationSize 459276288 vs 1073741824
MaxParameterSize 1024 vs 4352
MaxReadImageArguments 128 vs 256
MaxSamplers 16 vs 32
MaxWorkGroupSize 512 vs 1024
MaxWorkItemSizes [512, 512, 512] vs [1024, 1024, 64]
MaxWriteImageArguments 8 vs 16
MemoryBaseAddressAlignment 1024 vs 4096
OpenCLCVersion 1.2 vs 1.1
ProfilingTimerResolution 80 vs 1000
VendorId 32902 vs 4318
Obviously, there are hundreds of other devices to consider. I need a general formula!
Upvotes: 4
Views: 4532
Reputation: 5999
Why guess? Choose dynamically on your hardware of the day: Take the code you wish to run on the "best" GPU and run it, on a small amount of sample data, on each available GPU. Whichever finishes first: use it for the rest of your calculations.
Upvotes: 1
Reputation: 6333
I'm loving all of the solutions so far. If it is important to make the best device selection automatically, that's how to do it (weight the values based on your usage needs and take the highest score).
Alternatively, and much simpler, is to just take the first GPU device, but also have a way for the user to see the list of compatible devices and change it (either right away or on the next run).
This alternative is reasonable because most systems only have one GPU.
Upvotes: 0
Reputation: 2565
As @Adriano as pointed out, there are many things to take into considerations...too many things. But I can think of few things (and easier things that could be done) to help you out (not to completely solve your problem) :
First thing first, which version of OCL do you need (not really related to performance). But if you use some feature of OCL 1.2...well problem solved
You can usually (and crudely) categorized your algorithms in one of these two categories: memory bounded or computation bounded. In the case it's memory bound (with a lot of transfers between host and device) probably the most interesting info would be the device with Host Unified Memory. If not, the most powerful processors most probably would be more interesting.
But most probably it wouldn't be as easy to choose in which category put your application. In that case you could make a small benchmark. Roughly, this benchmark would test different size of data (if your app has to deal with that) on dummy computations which would more or less match the amount of computations your application requires (estimated by you after you completed the development of your kernels). You could log the point where the amount of data is so big that it cancels the device most powerful but connected via PCIe.
Another very important thing when programming on GPUs is the GPU occupancy. The higher, the best. NVIDIA provides an Excel file that calculates the occupancy based on some input. Based on these concepts, you could more or less reproduce the calculation of the occupancy (some adjustment will most probably needed for other vendors) for both GPUs and choose the one with the highest.
Of course, you need to know the values of these inputs. Some of them are based on your code, so you can calculate them before hands. Some of them are linked to the specs of the GPU. You can query some of them as you already did, for some others you might need to hardcode the values in some files after some googling (but at least you don't need to have these GPUs at hands to test on them). Last but not least, don't forget that OCL provides the clGetKernelWorkGroupInfo()
which can provide you some info such as the amount of local or private memory needed by a specific kernel.
Regarding the info about the local memory please note that remark from the standard:
If the local memory size, for any pointer argument to the kernel declared with the __local address qualifier, is not specified, its size is assumed to be 0.
So, it means that this info could be useless if you have first to dynamically compute the size from the host side. A work-around for that could be to use the fact that the kernels are compiled in JIT. The idea here would be to use the preprocessor option -D when calling clBuildProgram()
as I explained here. This would give you something like:
#define SIZE
__mykernel(args){
local myLocalMem[SIZE];
....
}
After all the blabla. I'm guessing that you worry about this because you might want to ship your application to some users without knowing what hardware they have. Would it be very inconvenient (at install time or maybe after by providing them a command or a button) to simply run you application with dummy generated data to measure which device performed better and simply log it in a config file?
Sometime, depending on you specific problem (that could not involve to many syncs) you don't have to choose. Sometime, you could just simply split the work between the two devices and use both...
Upvotes: 2
Reputation: 67090
You can not have a simple formula to calculate an index from that parameters.
First of all let me assume you can trust collected data, of course if you read 2 for MaxComputeUnits
but in reality it's 80 then there is nothing you can do (unless you have your own database of cards with all their specifications).
How can you guess if you do not know task you have to perform? It may be something highly parallel (then more units may be better) or a raw brute calculation (then higher clock frequency or bigger cache may be better). As for normal CPU number of threads isn't the only factor you have to consider for parallel tasks. Just to mention few things you have to consider:
To make it short: you can not calculate an index in a reliable way because factors are too many and they're strongly correlated (for example high parallelism may be slowed by small cache or slow memory access but a specific instruction, if supported, may give you great performance even if all other parameters are poor).
If you need a raw comparison you may even simply do MaxComputeUnits * MaxClockFrequency
(and it may even be enough for many applications) but if you need a more accurate index then don't think it'll be an easy task and you'll get a general purpose formula like (a + b / 2)^2
, it's not and results will be very specific to task you have to accomplish.
Write a small test (as much similar as possible to what your task is, take a look to this post on SO) and run it with many cards, with a big enough statistic you may extrapolate an index from an unknown set of parameters. Algorithms can become pretty complex and there is a vast literature about this topic so I won't even try to repeat them here. I would start with Wikipedia article as summary to other more specific papers. If you need an example of what you have to do you may read Exploring the Multiple-GPU Design Space.
Remember that more variables you add to your study more results quality will be unstable, less parameters you use less results will be accurate. To better support extrapolation:
MaxGroupSize
may not be so relevant). This phase is really important and decisions should be made with statistic tools (you may for example calculate p-value).There are many other things to consider, a good book about data mining would help you more than 1000 words written here.
Upvotes: 1