Rithwik
Rithwik

Reputation: 1198

Pandas dataframe from list of dictionaries or dictionary of lists? Efficiency

I have a simulation that generates data every step. I store the data in memory. After a certain number of steps, I create a pandas dataframe from the stored data and write(append) to file using df.to_csv. I can create a dataframe from a list of dictionaries or a dictionary of lists (where values are lists). Which of these would give me better performance and memory management?

Option A:

data = []
d={'a':1, 'b': 2} # Data from one step
data.append(d)
d={'a':2, 'b': 3} # Data from another step
data.append(d)
# data = [{'a':1, 'b':2}, {'a':2, 'b':3}]
df = pd.DataFrame(data)
with open(output_file, 'a') as f:
    df.to_csv(f, sep=",", index=False, header=f.tell()==0) 
    # Header added at first write

OR Option B:

data = {'a':[], 'b':[]}
data['a'].append(1) # Data from one step
data['b'].append(2)
data['a'].append(2) # Data from another step
data['b'].append(3)
# data = {'a':[1,2], 'b':[2,3]}
df = pd.DataFrame(data)
with open(output_file, 'a') as f:
    df.to_csv(f, sep=",", index=False, header=f.tell()==0) 
    # Header added at first write

I have 10^5 to 10^8 steps of data. I want to break down the writing into parts. That is,

  1. Store data for n steps in memory
  2. Create dataframe and write data from n steps
  3. Clear past n steps data from memory and repeat 1 and 2 for next n steps and so on...

So you can essentially imagine the above snippets as being in loops. I want the memory to be freed after every n-step of data. I am assuming re-declaring the data variable will achieve this, in both cases. If not, please advice how I can get this done so that memory usage is not cumulatively increasing.

Upvotes: 2

Views: 1638

Answers (1)

fthomson
fthomson

Reputation: 789

This seems to cover row oriented vs column oriented databases. Here is a really good answer covering the differences between the two. List with many dictionaries VS dictionary with few lists?

In summary the conversion is more expensive when assigning a row oriented structure as every individual dictionary must be read. However improvements would be marginal as best.

Upvotes: 3

Related Questions