Sujay_K
Sujay_K

Reputation: 163

How do I store a 100000*100000 matrix in C++?

I have two vectors a[100000] and b[100000]. I want to store a[i]*b[j] in a 100000 x 100000 matrix M. How can I do it in C++?

Upvotes: 0

Views: 3630

Answers (5)

hungptit
hungptit

Reputation: 1444

If you can compute a[i]*b[j] on the fly then you should do that because of two reasons:

  • Obtaining the results from a huge matrix may not be faster than compute the product of two double values on the fly.
  • 10000x10000 double matrix requires 80 Gbytes of storage (in-memory or on disk) and some extra work might be needed to access pre-computed data.

In my example below, I see 30x speed up (compiled in release mode using clang 3.8) if I compute the product of two double values on the fly for N=20000.

template <typename T>
void test_lookup(std::vector<T> &data, std::vector<size_t> &index,
                 std::vector<T> &results) {
    const size_t LOOP = index.size() / 2;
    for (size_t idx = 0; idx < LOOP; ++idx) {
        auto row = index[2 * idx];
        auto col = index[2 * idx + 1];
        results[idx] = data[col * LOOP + row];
    }
}

template <typename T>
void test_mul(std::vector<T> &x, std::vector<T> &y, std::vector<T> &results) {
    for (size_t idx = 0; idx < x.size(); ++idx) {
        results[idx] = x[idx] * y[idx];
    }
}

Upvotes: 0

user3528438
user3528438

Reputation: 2817

    #include <vector>
    class C
    {
    public:
        C(const std::vector<double>& a_, const std::vector<double>& b_)
            :a(a_),b(b_){};
        double operator()(size_t i, size_t j) const { return a[i]*b[j]; }
    private:
         std::vector<double> a, b;
    };

What the problem actually is?

The original question asks about a way to save C(i,j)=A(i)*B(j) to a matrix.

From an OOP point of view, such an matrix can be defined as an object with a method takes two inputs(i and j), and returns a result (ret=A(i)*B(j)).

This could be implemented using nested array subscriptions(c[i][j]), or linear array indexing(c[i*100000+j]), or a function (c.get(i, j)). The third way could also be simplified to a functor (c.operator()(i, j) or c(i, j)).

Then what?

If you agree to all above that any of the three interfaces serves the purpose, or at least partially (like I mentioned in the comment, if the matrix is only required to provide random read access to its elements). Then we continue to implement one of them, 3rd one being my choice.

Why do it that way?

My observation is that, computing the return value is not expensive, so why not calculate the product "lazily" when the product is actually accessed?

In this way, the storage is very efficient (memory usage is reduced from n^2 to 2n).

Hiding the multiplication in the getter function does not significantly increase access time (two memory accesses and one multiplication, compared with the ideal case being one memory access only, but both cases are constant time, and this implementation is much more cache friendly for the reduced size of the data).

So, instead of saving the product, just save the inputs, but calculate the product when a particular element is accessed.

What is missing?

Although manipulating this "matrix" is possible ( by changing the member a and b), it does not allow changing arbitrary element to arbitrary value.

Member functions that implements array slicing (like c(0:10:end, 4)) is not present either, but is feasible.

Test code

int main() {
    C c({1,2,3,4},{10,20,30,40}); // a={1,2,3,4}; b={10,20,30,40}
    cout << "3*30 "<<c(2,2);      // c(2, 2) = a[2]*b[2] = 3*30 = 90
    return 0;
}

Demo

http://ideone.com/bZR7AU

Upvotes: 1

Kijewski
Kijewski

Reputation: 26022

Using an in-RAM std::vector<double> would probably not feasible if you have less than 80GB of RAM in your system (for a 100000×100000 matrix of doubles).

Here is how you would do this using an mmap'd file. Please see the inline comments:

#include <sys/mman.h>
#include <stddef.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>

#define ROWS 1000
#define COLS 1000
#define FILENAME "./matrix.doubles"

int main(void)
{
    double (*matrix)[ROWS][COLS]; // pointer to our matrix
    int fd; // file descriptor of backing file

    // open backing file
    fd = open(FILENAME,
              O_CREAT | O_RDWR, // create (if absent) and/or read and writable
              S_IRUSR | S_IWUSR); // (only) user may read and write
    if (fd < 0) {
        perror("Could not open file");
        return 1;
    }

    if ((lseek(fd, sizeof(*matrix), SEEK_SET) == (off_t) -1) ||
        ftruncate(fd, sizeof(*matrix)) ||
        (lseek(fd, 0, SEEK_SET) == (off_t) -1)) {
        perror("Could not set file size.");
        return 1;
    }

    matrix = mmap(NULL, // I don't care were the address starts
                  sizeof(*matrix), // size of matrix in bytes
                  PROT_READ | PROT_WRITE, // readable and writable
                  MAP_PRIVATE, // we access the data exclusively
                  fd, // file descriptor of backing file
                  0); // offset
    if (matrix == MAP_FAILED) {
        perror("Could not mmap file.");
        return 1;
    }

    // operate on matrix
    for (unsigned row = 0; row < ROWS; ++row) {
        for (unsigned col = 0; col < COLS; ++col) {
            (*matrix)[row][col] = row * col;
        }
    }

    // close backing file
    munmap(matrix, sizeof(*matrix));
    close(fd);

    return 0;
}

This is pure C code. You could fancy up the code by using e.g. std::array<double, ROWS, COLS>& instead of a bare array, but I think the idea should be understandable.

Upvotes: 0

Humam Helfawi
Humam Helfawi

Reputation: 20264

The NON-contiguity part of this answer should be re-researched. It may be wrong.

If you want to work with big number of element such as 100000*100000. I would not recommend to use vector of vectors due to the "NON-contiguity" property of the inner vectors elements to each other. Small push_back may results in a lot of mess.

I would use single vector with a wrapper. See this for more information Clean ways to write multiple 'for' loops .

Upvotes: 2

Pierre
Pierre

Reputation: 1174

You can use a std::vector<std::vector<your_type>> to store the result.

int rows = 100000, cols = 100000;

std::vector<std::vector<double>> result;
result.resize(rows);

for(int i=0; i<rows; i++) {
        result[i].resize(cols);
}    

for(int i=0; i<rows; i++) {
    for(int j=0; j<cols; j++){
        result[i][j] = a[i] * b[j];
    }
}

Or you can use a linear algebra library, for example Eigen (you'll may have less to code with this), which will surely be more efficient.

Upvotes: 3

Related Questions