user4673
user4673

Reputation: 153

How much RAM does data take up?

How does one go about (asides from trail and error) to determine the amount of RAM required to store one's dataset?

I know this is a super general question, so, hopefully this example can narrow down what I am trying to understand:

I have a data file, the data file contains characters[A-Z] and numbers (no special symbols). I want to read the data into RAM (using python), then I store the data in a dictionary. I have a lot of data and computer with only 2 gigs of RAM, so I'd like to know ahead of time whether the data would fit into RAM as this could change the way I load the file with Python and handle the data downstream. I recognize that all the data may not all fit into RAM - but that's another problem, I just want to know how much RAM the data would take up and what I need to consider to make this determination.

So knowing the content of my file, it's initial size, and the downstream data structure I want to use, how can I figure out the amount of RAM the data will take-up?

Upvotes: 2

Views: 3235

Answers (1)

abarnert
abarnert

Reputation: 365597

The best thing to do here is not to try to guess, or to read the source code and write up a rigorous proof, but to do some tests. There are a lot of complexities that make these things hard to predict. For example, if you have 100K copies of the same string, will Python store 100K copies of the actual string data, or just 1? It depends on your Python interpreter and version, and all kinds of other things.

The documentation for sys.getsizeof has a link to a recursive sizeof recipe. And that's exactly what you need to measure how much storage your data structure is using.

So load in, say, the first 1% of your data and see how much memory it uses. Then load in 5% and make sure it's about 5x as big. If so, you can guess that your full data will be 20x as big again.

(Obviously this doesn't work for all conceivable data—there are some objects that have more cross-links the farther you get into the file, others—like numbers—that might just get larger, etc. But it will work for a lot of realistic kinds of data. And if you're really worried, you can always test the last 5% against the first 5% and see how they differ, right?)

You can also test at a higher level by using modules like Heapy, or completely externally by just watching with Process Manager/Activity Monitor/etc., to double-check the results. One thing to keep in mind is that many of these external measures will show you the peak memory usage of your program, not the current memory usage. And it's not even clear what you'd want to call "current memory usage" anyway. (Python rarely releases memory back to the OS. If it leaves memory unused, it will likely get paged out of physical memory by the OS, but the VM size won't come down. Does that count as in-use to you, or not?)

Upvotes: 4

Related Questions