David saffo
David saffo

Reputation: 23

Memory Blow Up when using Dask compute or persist with Dask Delayed

I am trying to process the data of several subjects all in one dataframe. There are >30 subjects and 14 computations per subject it is a large data set but any more than 5 blows up the memory on the scheduler node with out running any workers on the same node as the scheduler it has 128gb of memory? Any ideas how I can get around this or if im doing something wrong? code bellow.

def channel_select(chn,sub):

    subject = pd.DataFrame(df.loc[df['sub'] == sub])
    subject['s0'] = subject[chn]
    val = []
    for x in range(13):
        for i in range(len(subject)):
            val.append(subject['s0'].values[i-x])
        name = 's' + str(x+1)
        subject[name] = val
        val = []
    return subject

subs = df['sub'].unique()
subs = np.delete(subs, [34,33])

for s in subs:
    for c in chn:
        chn_del.append(delayed(channel_select)(c,subs[s]))

results = e.persist(pred)

I have the code shown to run all the subjects but anymore than 5 at a time and I run out of memory

Upvotes: 0

Views: 637

Answers (2)

David saffo
David saffo

Reputation: 23

As Mary stated above, every call to channel_select created and stores the dataframe on the schedulers memory, with 30 subjects calling 14 time each and a 2gb dataframe...yeah you can do the math of how much memory that was trying grab.

Upvotes: 0

xmadscientist
xmadscientist

Reputation: 26

You're telling the computer to keep almost 1,000 GB of memory.

But you knew that already (:

Upvotes: 1

Related Questions