Reputation: 39
I am writing a script which adds my simulated data to a pandas dataframe for n simulations in my loop. When I choose a value of n >~15 it crashes, I think my df becomes becomes too big to store in memory whilst running my simulations.
I create an empty DF
df = pd.DataFrame(
{'gamma': [],
'alpha': [],
'focus': [],
'X' : [],
'Y' : [],
'Z' : [],
'XX' : [],
'img1' : [],
'img2' : [],
'num_samples' : [],
'resolution' : []})
for i in range(n):
#some simulation stuffs
and then populate my data frame with the values
df.loc[i] = (
{'gamma': gamma,
'alpha': alpha,
'focus': focus,
'X' : X,
'Y' : Y,
'Z' : Z,
'XX' : XX,
'img1' : img1,
'img2' : img2,
'num_samples' : num_samples,
'resolution' : resolution})
I run through this n times to populate my df and then save it. However it keeps crashing. I thought that dask.dataframe might be good here:
df = dd.from_pandas(pd.DataFrame(
{'gamma': [],
'alpha': [],
'focus': [],
'X' : [],
'Y' : [],
'Z' : [],
'XX' : [],
'img1' : [],
'img2' : [],
'num_samples' : [],
'resolution' : []
}), chunksize=10)
and then populate my data
df.loc[i] = {'probe': wave.array.real,
'gamma': gamma,
'alpha': alpha,
'focus': focus,
'X' : X,
'Y' : Y,
'Z' : Z,
'XX' : X,
'img1' : img1,
'img2' : img2,
'num_samples' : num_samples,
'resolution' : resolution}
But I get an error '_LocIndexer' object does not support item assignment
I have considered saving creating the pd.df within the loop and saving it for each simulated value. But this seems inefficient and i think I should be able to do it within dask.
Any suggestions?
I'm operating windows, 20 GB RAM, SSD, i7 if that helps
Upvotes: 0
Views: 1148
Reputation: 28684
As the error message suggests, Dask does not generally allow you to alter the contents of a dataframe in-place. Furthermore, it is really unusual to try to append or otherwise change the size of a dask dataframe once created. Since you are running out of memory, Dask is still your tool of choice, so here is something that might be the easiest way to do it, keeping close to your original code.
meta = pd.DataFrame(
{'gamma': [],
'alpha': [],
'focus': [],
'X' : [],
'Y' : [],
'Z' : [],
'XX' : [],
'img1' : [],
'img2' : [],
'num_samples' : [],
'resolution' : []})
def chunk_to_data(df):
out = meta.copy()
for i in df.i:
out.loc[i] = {...}
return out
# the pandas dataframe will be small enough to fit in memory
d = dd.from_pandas(pd.DataFrame({'i': range(n)}, chunksize=10)
d.map_partitions(chunk_to_data, meta=meta)
This makes a lazy prescription, so when you come to process across your indices, you run one chunk at a time (per thread - be sure not to use too many threads).
In general, it may have been better to use dask.delayed
with a function taking start-i and end-i to make dataframes for each piece without an input dataframe, and then dd.from_delayed
to construct the pandas dataframe.
Upvotes: 2
Reputation: 13447
Please take it as a comment with some code rather than an answer. I'm not sure if what you're doing is the best way to work with pandas. As you want to store results maybe it's better to save them as chunk. Then when you want to do some analysis you can read them with dask. I will try to split simulations in a way their results fit in memory and save them to disk. Let says that only 100 simulations fit in memory you can do for every chunk of 100
import pandas as pd
cols = ['gamma', 'alpha', 'focus', 'X',
'Y', 'Z', 'XX', 'img1', 'img2',
'num_samples', 'resolution']
mem = 100
out = []
for i in range(mem):
# simulation returns list results
out.append(results)
df = pd.DataFrame(out, columns=cols)
df.to_csv('results/filename01.csv')
# or even better
df.to_parquet('results/filename01.parq')
Finally you can run this block in parallel with dask
or multiprocessing
(it mostly depends on the simulation. Is it single threaded or not?)
Upvotes: 0