How do I get the PTX in text form when compiling with CuPy for both NVRTC and NVCC backends?
The warp matrix multiply functions are producing vomit in the SASS output, so I want to study the PTX itself to see whether that is the library's or the PTX's fault.