Reputation: 2137
I have a very sparse matrix, with a density of about 0.01
, and dimensions 20000 x 500000
. I'm trying to load this in armadillo with
sp_mat V;
V.load(filename, coord_ascii);
The file format is
row column value
But this is taking way too long. Python can parse the file and fill a dictionary with it way faster than armadillo can create this matrix. How should I properly do this?
The matrix is going to be filled with integers.
Any advice would be appreciated!
This is an issue solely with Armadillo. C++ iterates the file without issue when read line by line, but assigning the values into an arma::sp_mat
is extremely slow.
Upvotes: 2
Views: 3589
Reputation: 2347
I today encountered this very same problem when trying to load 100MB CSV using Armadillo's .load()
. It's just too slow.
Since @Enrico Borba answered that he is doing his own file reading using std::ifstream and the result is pretty amazing, here is my own code to load a CSV file to the Armadillo's mat type using ifstream too.
For example, if you're trying to do this, it will take soo much time to load the file:
arma::mat A;
A.load("file.csv", arma::csv_ascii);
So this is an alternative, which is thousand more faster than above code:
arma::mat readCSV(const std::string &filename, const std::string &delimeter = ",")
{
std::ifstream csv(filename);
std::vector<std::vector<double>> datas;
for(std::string line; std::getline(csv, line); ) {
std::vector<double> data;
// split string by delimeter
auto start = 0U;
auto end = line.find(delimeter);
while (end != std::string::npos) {
data.push_back(std::stod(line.substr(start, end - start)));
start = end + delimeter.length();
end = line.find(delimeter, start);
}
data.push_back(std::stod(line.substr(start, end)));
datas.push_back(data);
}
arma::mat data_mat = arma::zeros<arma::mat>(datas.size(), datas[0].size());
for (int i=0; i<datas.size(); i++) {
arma::mat r(datas[i]);
data_mat.row(i) = r.t();
}
return data_mat;
}
Then you can substitute it like below:
arma::mat A = readCSV("file.csv");
Upvotes: 1
Reputation: 2137
The armadillo documentation specifies
"Using batch insertion constructors is generally much faster than consecutively inserting values using element access operators"
So here is the best I could come up with
sp_mat get(const char *filename) {
vector<long long unsigned int> location_u;
vector<long long unsigned int> location_m;
vector<double> values;
ifstream file(filename);
int a, b, c;
while(file >> a >> b >> c) {
location_u.push_back(a);
location_m.push_back(b);
values.push_back(c);
}
umat lu(location_u);
umat lm(location_m);
umat location(join_rows(lu, lm).t());
return V(location, vec(values));
}
It now runs at a reasonable speed, at about 1 million lines a second.
Upvotes: 5