niro
niro

Reputation: 977

Memory optimization in huge data set

Deal all, I have implemented some functions and like to ask some basic thing as I do not have a sound fundamental knowledge on C++. I hope, you all would be kind enough to tell me what should be the good way as I can learn from you. (Please, this is not a homework and i donot have any experts arround me to ask this)

What I did is; I read the input x,y,z, point data (around 3GB data set) from a file and then compute one single value for each point and store inside a vector (result). Then, it will be used in next loop. And then, that vector will not be used anymore and I need to get that memory as it contains huge data set. I think I can do this in two ways. (1) By just initializing a vector and later by erasing it (see code-1). (2) By allocating a dynamic memory and then later de-allocating it (see code-2). I heard this de-allocation is inefficient as de-allocation again cost memory or maybe I misunderstood.

Q1) I would like to know what would be the optimized way in terms of memory and efficiency.

Q2) Also, I would like to know whether function return by reference is a good way of giving output. (Please look at code-3)

code-1

int main(){

    //read input data (my_data)

    vector<double) result;
    for (vector<Position3D>::iterator it=my_data.begin(); it!=my_data.end(); it++){

         // do some stuff and calculate a "double" value (say value)
         //using each point coordinate 

         result.push_back(value);

    // do some other stuff

    //loop over result and use each value for some other stuff
    for (int i=0; i<result.size(); i++){

        //do some stuff
    }

    //result will not be used anymore and thus erase data
    result.clear()

code-2

int main(){

    //read input data

    vector<double) *result = new vector<double>;
    for (vector<Position3D>::iterator it=my_data.begin(); it!=my_data.end(); it++){

         // do some stuff and calculate a "double" value (say value)
         //using each point coordinate 

         result->push_back(value);

    // do some other stuff

    //loop over result and use each value for some other stuff
    for (int i=0; i<result->size(); i++){

        //do some stuff
    }

    //de-allocate memory
    delete result;
    result = 0;
}

code03

vector<Position3D>& vector<Position3D>::ReturnLabel(VoxelGrid grid, int segment) const
{
  vector<Position3D> *points_at_grid_cutting = new vector<Position3D>;
  vector<Position3D>::iterator  point;

  for (point=begin(); point!=end(); point++) {

       //do some stuff         

  }
  return (*points_at_grid_cutting);
}

Upvotes: 0

Views: 419

Answers (6)

Darcara
Darcara

Reputation: 1618

code-1 will work fine and is almost the same as code-2, with no major advantages or disadvantages.

code03 Somebody else should answer that but i believe the difference between a pointer and a reference in this case would be marginal, I do prefer pointers though.

That being said, I think you might be approaching the optimization from the wrong angle. Do you really need all points to compute the output of a point in your first loop? Or can you rewrite your algorithm to read only one point, compute the value as you would in your first loop and then use it immediately the way you want to? Maybe not with single Points, but with batches of points. That could potentially cut back on your memory require quite a bit with only a small increase in processing time.

Upvotes: -1

Steve Jessop
Steve Jessop

Reputation: 279345

erase will not free the memory used for the vector. It reduces the size but not the capacity, so the vector still holds enough memory for all those doubles.

The best way to make the memory available again is like your code-1, but let the vector go out of scope:

int main() {
    {
        vector<double> result;
        // populate result
        // use results for something
    }
    // do something else - the memory for the vector has been freed
}

Failing that, the idiomatic way to clear a vector and free the memory is:

vector<double>().swap(result);

This creates an empty temporary vector, then it exchanges the contents of that with result (so result is empty and has a small capacity, while the temporary has all the data and the large capacity). Finally, it destroys the temporary, taking the large buffer with it.

Regarding code03: it's not good style to return a dynamically-allocated object by reference, since it doesn't provide the caller with much of a reminder that they are responsible for freeing it. Often the best thing to do is return a local variable by value:

vector<Position3D> ReturnLabel(VoxelGrid grid, int segment) const
{
  vector<Position3D> points_at_grid_cutting;
  // do whatever to populate the vector
  return points_at_grid_cutting;
}

The reason is that provided the caller uses a call to this function as the initialization for their own vector, then something called "named return value optimization" kicks in, and ensures that although you're returning by value, no copy of the value is made.

A compiler that doesn't implement NRVO is a bad compiler, and will probably have all sorts of other surprising performance failures, but there are some cases where NRVO doesn't apply - most importantly when the value is assigned to a variable by the caller instead of used in initialization. There are three fixes for this:

1) C++11 introduces move semantics, which basically sort it out by ensuring that assignment from a temporary is cheap.

2) In C++03, the caller can play a trick called "swaptimization". Instead of:

vector<Position3D> foo;
// some other use of foo
foo = ReturnLabel();

write:

vector<Position3D> foo;
// some other use of foo
ReturnLabel().swap(foo);

3) You write a function with a more complicated signature, such as taking a vector by non-const reference and filling the values into that, or taking an OutputIterator as a template parameter. The latter also provides the caller with more flexibility, since they need not use a vector to store the results, they could use some other container, or even process them one at a time without storing the whole lot at once.

Upvotes: 1

Sebastian Mach
Sebastian Mach

Reputation: 39119

`Destructors in C++ must not fail, therefore deallocation does not allocate memory, because memory can't be allocated with the no-throw guarantee.

Apart: Instead of looping multiple times, it is probably better if you do the operations in an integrated manner, i.e. instead of loading the whole dataset, then reducing the whole dataset, just read in the points one by one, and apply the reduction directly, i.e. instead of

load_my_data()
for_each (p : my_data)
    result.push_back(p)

for_each (p : result)
    reduction.push_back (reduce (p))

Just do

file f ("file")
while (f)
    Point p = read_point (f)
    reduction.push_back (reduce (p))

If you don't need to store those reductions, simply output them sequentially

file f ("file")
while (f)
    Point p = read_point (f)
    cout << reduce (p)

Upvotes: 0

FailedDev
FailedDev

Reputation: 26940

vector<double) result;
    for (vector<Position3D>::iterator it=my_data.begin(); it!=my_data.end(); it++){

         // do some stuff and calculate a "double" value (say value)
         //using each point coordinate 

         result.push_back(value);

If the "result" vector will end up having thousands of values, this will result in many reallocations. It would be best if you initialize it with a large enough capacity to store, or use the reserve function :

vector<double) result (someSuitableNumber,0.0);

This will reduce the number of reallocation, and possible optimize your code further.

Also I would write : vector<Position3D>& vector<Position3D>::ReturnLabel(VoxelGrid grid, int segment) const

Like this :

void vector<Position3D>::ReturnLabel(VoxelGrid grid, int segment, vector<Position3D> & myVec_out) const //myVec_out is populated inside func

Your idea of returning a reference is correct, since you want to avoid copying.

Upvotes: 0

thiton
thiton

Reputation: 36059

Your code seems like the computed value from the first loop is only used context-insensitively in the second loop. In other words, once you have computed the double value in the first loop, you could act immediately on it, without any need to store all values at once.

If that's the case, you should implement it that way. No worries about large allocations, storage or anything. Better cache performance. Happiness.

Upvotes: 1

Kyrylo Kovalenko
Kyrylo Kovalenko

Reputation: 2171

For such huge data sets I would avoid using std containers at all and make use of memory mapped files.

If you prefer to go on with std::vector, use vector::clear() or vector::swap(std::vector()) to free memory allocated.

Upvotes: 2

Related Questions