Reputation: 217
I am writing a code for sorting a Large chunk of strings (~ 2GB) and I am using a method similar to bucket sort. I have about 47000 vectors and each of them will have ~100 elements (char*) on average. (Due to input, some of them could have a lot of elements and some could be empty)
My code for getting input is something like this:
#include <iostream>
#include <string>
#include <vector>
#include <cstring>
#define digitize(a) ((a>47 && a<58)*(a-48)+(a>96 && a<123)*(a-87)) //base 36 convert
int main(){
int n = 0;
scanf("%d\n", &n);
vector<char *> buckets[46657]; // 36 *36 *36 = 46656
for (int i = 0; i < n; i++) {
char *temp = static_cast<char *>(calloc(255, sizeof(char)));
scanf("%s\n", temp);
int d1 = temp[0];
int d2 = temp[1];
int d3 = temp[2];
int index = digitize(d1) * 1296 + digitize(d2) * 36 + digitize(d3); // 1296 = 36*36
buckets[index].push_back(temp);
}
}
digitize is actually a base 36 converter. (i.e. 0:0, 1:1 .... a:10, b:11, ... , z:36) Because my data consists of numbers and lowercase chars.
By running this code on a 500MB dataset (Which is randomly generated) the ram usage of the file goes above 4GB and near 5GB. (OS: Windows7 64bit, IDE: Jetbrains CLion, Compiler: G++)
Is it normal? I don't understand why it use these huge amount of ram just for getting data. I also checked the output by adding another loop and they were correct. So there is no infinite Loop or something like that. The code works fine but use a huge amount of RAM.
Any Idea why it uses this huge amount of RAM.
Upvotes: 1
Views: 377
Reputation: 73465
You create an array of 46657 vector<char*>
. Each vector has in average 100 char*
pointing to a newly allocated string of 255 bytes. That's at least sizeof(buckets)+46657*100*(sizeof(char*)+255)
bytes. Depending on the implementation, this could be around 1.2 Gb. The vectors may hold some space reserved for faster growth. But this will not change fundamentally our order of magnitude.
All this is huge, perhaps more than you need, but it's far from the 5Gb that you measure. But what did you measure in the first place?
The memory consumption statistics that you provide are most probably managed at operating system level. It's the memory consumed by the process executing your code, not necessarily the code that your code consumes. All this is implementation dependent, but in general the standard library may allocate very big memory chunks from the OS, because calls to OS are more expensive than local calls in the user space. This big chunk is then cut into pieces at each new
or malloc
call. It's like a wholesaler and a retailer.
To know the memory that your code consume, you need to do more precise memory monitoring/profiling. For example you could use valgrind
or other tools depending on your OS.
We do not see the full code, so leaks are not excluded either. Since you manually manage memroy, the risk of leaking is higher. Not speaking of unsanitized scanning, which could exceed the 255 allocated chars and corrupt memory.
A safer approach could therefore be:
vector<string> buckets[46657];
...
for ...
string temp; // let the string take care of its own memory
getline(cin, temp);
...
As a side effect, you also benefit from optimized memory management, if you have many smaller strings.
Upvotes: 4