Solution of sparse linear system on GPU, paper from nvidia

Question

I am reading through an Nvidia article about solving linear systems (sparse) on GPU. I got stuck on the construction of the chainPtrHost data structure. I understand what it does but I don't understand how it's constructed. It is mentioned that it depends on the parallelism across multiple levels but how that is determined doesn't seem to be explained.

Does anyone know how it can be constructed?

Paper for reference: Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU

paleonix · Accepted Answer

My understanding is that one just iterates through the levels sequentially on the host like in the following, very basic, untested C++ snippet:

#include 

std::vector create_chains(const std::vector &levels, int threshold) {
    const int num_levels = static_cast(std::size(levels) - 1);
    
    std::vector chains{};
    chains.reserve(std::size(levels)); // optional optimization to avoid re-allocation

    bool no_open_chain = true;

    for (int idx = 0; idx < num_levels; ++idx) {
        const int num_rows = levels[idx + 1] - levels[idx];
        const bool new_single_link_chain = (num_rows > threshold);

        if (no_open_chain || new_single_link_chain) { // new chain starts here
            chains.push_back(idx + 1); // 1-based index
        }

        no_open_chain = new_single_link_chain;
    }
    chains.push_back(num_levels + 1); // 1-based end of last chain

    return chains;
}

The right value for threshold depends on the implementation of the single-block solver kernel. If each thread operates on one row and there is no restriction on the block size for occupancy/shared memory/etc., it could be 1024, i.e. the maximum number of threads (rows) per block.

Solution of sparse linear system on GPU, paper from nvidia

Answers (1)

Related Questions