Reputation: 397
My dev env is as follows:
Device: Nexus 5
Android: 4.4.2
SDK Tools: 22.6.1
Platform Tools: 19.0.1
Build tools: 19.0.3
Build Target: level 19
Min Target: level 19
I'm doing some image processing application. Basically I need to go through a preprocessing step to the image and then use convolution 5x5 to filter the image. In the preprocessing step, I successfully made the script to run on GPU and achieve good performance. Since Renderscript offers a 5x5 convolution intrinsics, I'd like to use it to make the whole pipeline as fast as possible. However, I found using the 5x5 convolution intrinsics after the preprocssing step is very slow. In contrast, if I use the adb tool to force all the scripts to run on CPU, the speed of the 5x5 convolution intrinsics is a lot faster. In both cases, the time consumed by the preprocessing step is basically the same. So it was the performance of the intrinsics which made the difference.
Also, in the code I use
Allocation.USAGE_SHARED
in creating all the Allocations, hoping the shared memory would facilitate memory access between CPU and GPU.
Since I understand that intrinsics runs on CPU, is this behavior expected? Or did I miss anything? Is there a way to make the GPU script/CPU intrinsics mixed code fast? Thanks a lot!
Upvotes: 0
Views: 322
Reputation: 672
The 5X5 convolve Intrinsic (in default android rs driver for CPU) uses Neon. This is extremely fast and my measurements proved the same as well. In general, I did not find any rs apis then uses 5x5 convolve on two 5x5 matrices. This is a problem as it prevents one from writing more complex kernels.
Given the performance differences you are noticing, it is quite possible that that the GPU driver on your device supports a 5X5 convolve intrinsic that likely runs slower than the CPU 5X5 convolveIntrinsic that uses neon. So forcing CPU usage for renderscript is giving better performance.
Upvotes: 1