Reputation: 10022
Very simple question. I have access to a multi-node machine and I have to do some NCCL tests. In the readme it says
If CUDA is not installed in /usr/local/cuda, you may specify CUDA_HOME. Similarly, if NCCL is not installed in /usr, you may specify NCCL_HOME.
I can see that CUDA is installed but (and here it comes my question)
I have done
find /usr -name "libnccl.so*" 2>/dev/null
and I found this file. However when I di
find /usr -name "nccl.h" 2>/dev/null
it was not found. Obviously I could not build even the simplest
#include <stdio.h>
#include <nccl.h>
int main() {
printf("NCCL version: %d\n", NCCL_VERSION_CODE);
return 0;
}
(Btw, I think the OS is CentOS)
Upvotes: 0
Views: 142
Reputation: 7
It is likely you have the runtime:
sudo yum install -y libnccl
But not the development environment:
sudo yum install -y libnccl-devel
As an alternative, since you have the HPC tag, most HPC cluster tend to have their code under modules (env mod, or lmod) and those are usually outside /usr. You can look with
module avail nccl
If it is there you could load the module and should have access to the development environment.
For the actual finding, If it is in a module, the the previous command will tell, and you can check in the module file to see if any variable like nccl_home is set which might make it easier.
You can also use ldconfig, which might work (if it doesn't show, it could be a false negative as there are other reasons other than not being installed that could cause the negative), it prints all the shared libraries cached by the system.
ldconfig -p | grep libnccl
Finally, specific to this case, try to run nvidia-smi if it is installed (and in path), it should print an output indicating the version (and maybe location?) of nccl.
Upvotes: 0