Reputation: 39

Pandas Dataframe too large for memory, problems implementing dask

I am writing a script which adds my simulated data to a pandas dataframe for n simulations in my loop. When I choose a value of n >~15 it crashes, I think my df becomes becomes too big to store in memory whilst running my simulations.

I create an empty DF

df = pd.DataFrame(
    {'gamma': [],
     'alpha': [],
     'focus': [],
     'X' : [],
     'Y' : [],
     'Z' : [],
     'XX' : [],
     'img1' : [],
     'img2' : [],
     'num_samples' : [],
     'resolution' : []})

for i in range(n):
#some simulation stuffs

and then populate my data frame with the values

df.loc[i] = (
    {'gamma': gamma,
    'alpha': alpha,
    'focus': focus,
    'X' : X,
    'Y' : Y,
    'Z' : Z,
    'XX' : XX,
    'img1' : img1,
    'img2' : img2,
    'num_samples' : num_samples,
    'resolution' : resolution})

I run through this n times to populate my df and then save it. However it keeps crashing. I thought that dask.dataframe might be good here:

df = dd.from_pandas(pd.DataFrame(
{'gamma': [],
 'alpha': [],
 'focus': [],
 'X' : [],
 'Y' : [],
 'Z' : [],
 'XX' : [],
 'img1' : [],
 'img2' : [],
 'num_samples' : [],
 'resolution' : []
}), chunksize=10)

and then populate my data

df.loc[i] = {'probe': wave.array.real,
    'gamma': gamma,
    'alpha': alpha,
    'focus': focus,
    'X' : X,
    'Y' : Y,
    'Z' : Z,
    'XX' : X,
    'img1' : img1,
    'img2' : img2,
    'num_samples' : num_samples,
    'resolution' : resolution}

But I get an error '_LocIndexer' object does not support item assignment

I have considered saving creating the pd.df within the loop and saving it for each simulated value. But this seems inefficient and i think I should be able to do it within dask.

Any suggestions?

I'm operating windows, 20 GB RAM, SSD, i7 if that helps

Upvotes: 0

Answers (2)

mdurant

Reputation: 28684

As the error message suggests, Dask does not generally allow you to alter the contents of a dataframe in-place. Furthermore, it is really unusual to try to append or otherwise change the size of a dask dataframe once created. Since you are running out of memory, Dask is still your tool of choice, so here is something that might be the easiest way to do it, keeping close to your original code.

meta = pd.DataFrame(
    {'gamma': [],
     'alpha': [],
     'focus': [],
     'X' : [],
     'Y' : [],
     'Z' : [],
     'XX' : [],
     'img1' : [],
     'img2' : [],
     'num_samples' : [],
     'resolution' : []})

def chunk_to_data(df):
    out = meta.copy()
    for i in df.i:
        out.loc[i] = {...}
    return out

# the pandas dataframe will be small enough to fit in memory
d = dd.from_pandas(pd.DataFrame({'i': range(n)}, chunksize=10)

d.map_partitions(chunk_to_data, meta=meta)

This makes a lazy prescription, so when you come to process across your indices, you run one chunk at a time (per thread - be sure not to use too many threads).

In general, it may have been better to use dask.delayed with a function taking start-i and end-i to make dataframes for each piece without an input dataframe, and then dd.from_delayed to construct the pandas dataframe.

Upvotes: 2

rpanai

Reputation: 13447

Please take it as a comment with some code rather than an answer. I'm not sure if what you're doing is the best way to work with pandas. As you want to store results maybe it's better to save them as chunk. Then when you want to do some analysis you can read them with dask. I will try to split simulations in a way their results fit in memory and save them to disk. Let says that only 100 simulations fit in memory you can do for every chunk of 100

import pandas as pd

cols = ['gamma', 'alpha', 'focus', 'X',
        'Y', 'Z', 'XX', 'img1', 'img2',
        'num_samples', 'resolution']

mem = 100
out = []
for i in range(mem):
    # simulation returns list results
    out.append(results)

df = pd.DataFrame(out, columns=cols)
df.to_csv('results/filename01.csv')
# or even better 
df.to_parquet('results/filename01.parq')

Finally you can run this block in parallel with dask or multiprocessing (it mostly depends on the simulation. Is it single threaded or not?)

Upvotes: 0

Pandas Dataframe too large for memory, problems implementing dask

Answers (2)

Related Questions