plam
plam

Reputation: 1375

Memory usage/efficiency for pandas dataframe versus lists versus tuples, etc.

I'm trying to create a class in Python that ends up storing some text documents along with some metadata for each of the documents. Think of a structure like this:

ID    Text                        Date       Followers
1     "This is a tweet"           10/21/14   57
2     "This is another tweet"     10/22/14   100
3     "Yet another"               10/23/14   3899 
4     "Another one"               10/25/14   234

What's the best and most memory efficient way to store stuff like this? Is it as four different lists (for example)? Or maybe a dictionary and/or tuples? Or as a Pandas Dataframe?

Are there significant differences between each one? I would like to store them as a Pandas dataframe just for ease of working with the data, but I also want to be mindful of memory usage and speed for larger datasets.

Upvotes: 3

Views: 2953

Answers (1)

JD Long
JD Long

Reputation: 60756

Your question is really too broad to answer simply. However I can share a few thoughts.

I tend to only think of my data in 3 buckets as it relates to size:

  1. Fits in memory on one machine
  2. Fits on disk on one machine but not in memory
  3. Too big for one machine

We can spend forever talking about which framework or data structure we should use for each of these three buckets. However I've found that for my analytical work 90% of the time it's simple:

  1. Numpy array or Pandas
  2. PyTables
  3. Hadoop or Distributed Database

I only look for a data structure other than the above if I have a compelling reason.

I hope that helps a bit.

Upvotes: 5

Related Questions