Reputation: 1198
I have a simulation that generates data every step. I store the data in memory.
After a certain number of steps, I create a pandas dataframe from the stored data and write(append) to file using df.to_csv
.
I can create a dataframe from a list of dictionaries or a dictionary of lists (where values are lists). Which of these would give me better performance and memory management?
Option A:
data = []
d={'a':1, 'b': 2} # Data from one step
data.append(d)
d={'a':2, 'b': 3} # Data from another step
data.append(d)
# data = [{'a':1, 'b':2}, {'a':2, 'b':3}]
df = pd.DataFrame(data)
with open(output_file, 'a') as f:
df.to_csv(f, sep=",", index=False, header=f.tell()==0)
# Header added at first write
OR Option B:
data = {'a':[], 'b':[]}
data['a'].append(1) # Data from one step
data['b'].append(2)
data['a'].append(2) # Data from another step
data['b'].append(3)
# data = {'a':[1,2], 'b':[2,3]}
df = pd.DataFrame(data)
with open(output_file, 'a') as f:
df.to_csv(f, sep=",", index=False, header=f.tell()==0)
# Header added at first write
I have 10^5
to 10^8
steps of data. I want to break down the writing into parts. That is,
n
steps in memoryn
stepsn
steps data from memory and repeat 1 and 2 for next n
steps and so on...So you can essentially imagine the above snippets as being in loops. I want the memory to be freed after every n-step of data. I am assuming re-declaring the data
variable will achieve this, in both cases. If not, please advice how I can get this done so that memory usage is not cumulatively increasing.
Upvotes: 2
Views: 1638
Reputation: 789
This seems to cover row oriented vs column oriented databases. Here is a really good answer covering the differences between the two. List with many dictionaries VS dictionary with few lists?
In summary the conversion is more expensive when assigning a row oriented structure as every individual dictionary must be read. However improvements would be marginal as best.
Upvotes: 3