tglas
tglas

Reputation: 1010

Why are non-positive strides disallowed in the blas gemm family of functions?

The netlib documentation of sgemm states that the array strides LDA and LDB must be >= 1, and large enough so that columns don't overlap. Indeed, the implementation in Apple's Accelerate/veclib framework checks these conditions and exists if they are violated.

I really don't understand why this limitation exists. Can't BLAS simply believe me that I really want a stride of zero, or a negative stride? As far as I understand, Fortran integers are signed by default, so parameter types don't seem to be the reaons (disclaimer: I don't know Fortan very well).

Indeed, there exist very reasonable use cases of non-positive array strides:

I am interested in two aspects:

  1. I'd like to understand why the restriction exists. Is it really just because the designers of BLAS did not think about such use cases? Am I the victim of someone trying to catch a bug that's indeed a feature? Or does the restriction result in better performance? I have a hard time imagining the latter.
  2. Is there a way to work around the restriction without sacrificing (too much) performance? For now I need something that works on a Mac in C++, but long term it should still be based on BLAS, so it works across platforms.

Upvotes: 4

Views: 685

Answers (2)

alfC
alfC

Reputation: 16262

EDIT 2024:

It turns out that axpy was a bad example. The answer is that more general BLAS functions do not generally accept stride == 0 arrays. See bellow for details.


ANSWER from 2023 (outdated):

Regarding the case of zero strides: I am using a C++ multidimensional array library (disclaimer: my own) that supports broadcasting and BLAS interfaces. I can try many cases quickly to see what BLAS versions support zero strides.

Here is the example of axpy; in Linux at least, all the versions of BLAS I have available support zero strides (broadcasted arrays).

#include<multi/array.hpp>  // from https://gitlab.com/correaa/boost-multi
#include<multi/adaptors/blas.hpp>

namespace multi = boost::multi;

int main() {
    multi::array<double, 1> y = {1.0, 2.0, 3.0};

    // regular arrays, stride != 0
    multi::array<double, 1> const x = {1.0, 1.0, 1.0};

    multi::blas::axpy(1.0, x.begin(), y);

    assert(y[0] == 2.0);
    assert(y[1] == 3.0);
    assert(y[2] == 4.0);

    // broadcasted array of 0D into 1D, stride == 0
    multi::array<double, 0> const x0(1.0);

    // x0.broadcasted() is {..., 1.0, 1.0, ...}

    multi::blas::axpy(1.0, x0.broadcasted().begin(), y);

    assert(y[0] == 3.0);
    assert(y[1] == 4.0);
    assert(y[2] == 5.0);
}

The program compiles and runs correctly with BLAS (netlib), openBLAS, MKL, NVHPC (PGI), ATLAS and BLIS. (Unfortunately, I don't have a Mac device to Apple libs.)

The good news is that none of these libraries check for zero strides and give the correct answer, at least for axpy.

c++ axpy.cpp -Iinclude/ `pkg-config --libs blas` && ./a.out
c++ axpy.cpp -Iinclude/ `pkg-config --libs openblas` && ./a.out
c++ axpy.cpp -Iinclude/ `pkg-config --libs mkl-static-ilp64-seq` && ./a.out
c++ axpy.cpp -Iinclude/ /opt/nvidia/hpc_sdk/Linux_x86_64/23.9/compilers/lib/libblas.so && ./a.out
c++ axpy.cpp -Iinclude/ `pkg-config --libs blas-atlas` && ./a.out
c++ axpy.cpp -Iinclude/ /usr/lib/x86_64-linux-gnu/blis64-openmp/libblas64.so && ./a.out

EDIT 2024:

It turns out that axpy was a bad example. When I try to do the same passing a (constant) array with stride 0 to GEMV, all the implementations of BLAS that I have access to give an error message and return 0 values in the destination.

$ c++ -std=c++17 -I./include/ include/boost/multi/adaptors/blas/test/gemv.cpp `pkg-config --libs blas` && ./a.out

 ** On entry to DGEMV  parameter number  8 had an illegal value
$ c++ -std=c++17 -I./include/ include/boost/multi/adaptors/blas/test/gemv.cpp `pkg-config --libs openblas` && ./a.out

 ** On entry to DGEMV  parameter number  8 had an illegal value
$ c++ -std=c++17 -I./include/ include/boost/multi/adaptors/blas/test/gemv.cpp `pkg-config --libs blas-atlas` && ./a.out 

 ** On entry to DGEMV  parameter number  8 had an illegal value
$ c++ -std=c++17 -I./include/ include/boost/multi/adaptors/blas/test/gemv.cpp /usr/lib/x86_64-linux-gnu/blis-openmp/libblas.so && ./a.out 

 ** On entry to DGEMV  parameter number  8 had an illegal value
$ c++ -std=c++17 -I./include/ include/boost/multi/adaptors/blas/test/gemv.cpp `pkg-config --libs mkl-static-ilp64-seq` && ./a.out

Intel oneMKL ERROR: Parameter 8 was incorrect on entry to DGEMV .
/opt/nvidia/hpc_sdk/Linux_x86_64/24.7/compilers/bin/nvc++ -std=c++17 -I./include/ include/boost/multi/adaptors/blas/test/gemv.cpp /opt/nvidia/hpc_sdk/Linux_x86_64/24.7/compilers/lib/libblas_lp64.so.0 && ./a.out 

 ** On entry to DGEMV  parameter number  8 had an illegal value

Of course, parameter number 8 is incx, the stride of the input vector. https://www.netlib.org/lapack/explore-html/d7/dda/group__gemv.html

Even in Apple Mac the test fail (hey! at least I learned that in Mac, BLAS uses cblas). (It also terminates immediately in Mac, which I like)

 BLAS error: Parameter number 9 passed to cblas_dgemv had an invalid value

This is the code to test. Note that for the version with stride 1 there is no problem.

    multi::array<double, 2> const a = {
        {1.0, 2.0, 3.0},
        {4.0, 5.0, 6.0}
    };

    {
        multi::array<double, 1> const ones({3}, 1.0);

        assert( ones.stride() == 1 );

        assert( ones[0] == 1.0 );
        assert( ones[1] == 1.0 );
        assert( ones[2] == 1.0 );

        multi::array<double, 1> sum_by_rows({2}, 0.0);
        blas::gemv_n(1.0, a.begin(), 2, ones.begin(), 0.0, sum_by_rows.begin());

        std::cout << sum_by_rows[0] << " " << sum_by_rows[1] << "\n";
        BOOST_TEST( std::abs( sum_by_rows[0] - (1.0 + 2.0 + 3.0)) < 1.0e-8 );
        BOOST_TEST( std::abs( sum_by_rows[1] - (4.0 + 5.0 + 6.0)) < 1.0e-8 );
    }
    // the code below simulates the same operation with an array of stride zero
    {
        multi::array<double, 0> const one(1.0);
        auto const& ones = one.broadcasted();  // BLAS doesn't work with stride zero

        assert( ones.stride() == 0 );

        assert( ones[0] == 1.0 );
        assert( ones[1] == 1.0 );
        assert( ones[2] == 1.0 );

        multi::array<double, 1> sum_by_rows({2}, 0.0);
        blas::gemv_n(1.0, a.begin(), 2, ones.begin(), 0.0, sum_by_rows.begin());

        std::cout << sum_by_rows[0] << " " << sum_by_rows[1] << "\n";
        assert( std::abs( sum_by_rows[0] - (1.0 + 2.0 + 3.0)) < 1.0e-8 );
        assert( std::abs( sum_by_rows[1] - (4.0 + 5.0 + 6.0)) < 1.0e-8 );
    }

I am going to archive the test here: https://gitlab.com/correaa/boost-multi/-/blob/master/include/boost/multi/adaptors/blas/test/gemv.cpp

Upvotes: 2

alfC
alfC

Reputation: 16262

I recently found myself effectively doing this:

double s[4] = {10., 2., 3.1, 4.1};
dscal_(4, 3., s, -1);
assert( s[1] == 2.*3. );

dscal_ is the simplest BLAS function, multiplying an array by a scalar, its signature is:

void sscal(int, double, double*, int); // doesn't seem to return anything useful

In my particular distribution of BLAS (shipped with Fedora 28) this gave a silent error, as the function doesn't seem to do anything. Moreover, dscal_ doesn't even seem to return an error code, so there is no way I can catch this error without a wrapping function (my arrays have runtime positive or negative strides but I cannot control the value in all cases).

All these cases failed:

double s[4] = {10., 2., 3.1, 4.1};
dscal_(4, 3., s, -1); // negative incx does nothing
dscal_(4, 3., &s[4], -1); // negative incx does nothing
dscal_(-4, 3., &s[4], 1); // negative n does nothing
dscal_(-4, 3., &s[4], -1); // negative n or incx does nothing
assert( s[1] == 2. );

This tells me that, although probably not documented anywhere the stride (incx) has to be positive, (and also the size). Fortunately, for many BLAS function the call can be converted into positive strides. I needed a wrapper function to debug this any way, so wrote the folloing wrapper function:

void signed_dscal(int n, double alpha, double* x, int incx){
    int prod = incx*n;
    if(prod > 0) dscal(abs(n), alpha, x, abs(incx));
    else         dscal(abs(n), alpha, x + prod, abs(incx));
}

In this way, I can call signed_dscal with positive or negative strides and sizes.

int main(){
{
    double d[4] = {10., 2., 3.1, 4.1};
    signed_dscal(4, 3., d, 1);
    assert( d[1] == 6. );
}
{
    double d[4] = {10., 2., 3.1, 4.1};
    signed_dscal(4, 3., &d[4], -1);
    assert( d[1] == 6. );
}
{
    double d[4] = {10., 2., 3.1, 4.1};
    signed_dscal(-4, 3., &d[4], 1);
    assert( d[1] == 6. );
}
{
    double d[4] = {10., 2., 3.1, 4.1};
    signed_dscal(-4, 3., d, -1);
    assert( d[1] == 6. );
}    
    return 0;
}

(Note that incx=0 still is a case impossible to amend.)

I still don't understand what is the logic behind this. Perhaps some implementation of BLAS will let you do this by default and others will try to protect against infinite loops which as a side effect will assume positive stride values.


Intel BLAS seems to imply that it supports negative strides because it mentions abs(incx), at least for AXPY: https://www.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top/blas-and-sparse-blas-routines/blas-routines/blas-level-1-routines-and-functions/cblas-axpy.html

I didn't try and remains to be seen if it applies to other functions.

Upvotes: 2

Related Questions