Ulderique Demoitre
Ulderique Demoitre

Reputation: 1068

Pandas dataframe pivot not fitting in memory

I have a dataframe df with the following structure:

        val          newidx    Code
Idx                             
0       1.0      1220121127    706
1       1.0      1220121030    706
2       1.0      1620120122    565

It has 1000000 lines. In total we have 600 unique Code value and 200000 unique newidx values.

If I perform the following operation

df.pivot_table(values='val', index='newidx', columns='Code', aggfunc='max')

I get a MemoryError . but this sounds strange as the size of the resulting dataframe should be sustainable: 200000x600.

How much memory requires such operation? Is there a way to fix this memory error?

Upvotes: 10

Views: 13908

Answers (2)

mproffitt
mproffitt

Reputation: 2527

I've had a very similar problem when carrying out a merge between 4 dataframes recently.

What worked for me was disabling the index during the groupby, then merging.

if @Kartiks answer doesn't work, try this before chunking the DataFrame.

df.groupby(['newidx', 'Code'], as_index=False)['val'].max().unstack()

Upvotes: 3

Kartik
Kartik

Reputation: 8693

Try to see if this fits in your memory:

df.groupby(['newidx', 'Code'])['val'].max().unstack()

pivot_table is unfortunately very memory intensive as it may make multiple copies of data.


If the groupby does not work, you will have to split your DataFrame into smaller pieces. Try not to assign multiple times. For example, if reading from csv:

df = pd.read_csv('file.csv').groupby(['newidx', 'Code'])['val'].max().unstack()

avoids multiple assignments.

Upvotes: 12

Related Questions