Reputation: 1068
I have a dataframe df
with the following structure:
val newidx Code
Idx
0 1.0 1220121127 706
1 1.0 1220121030 706
2 1.0 1620120122 565
It has 1000000 lines.
In total we have 600 unique Code
value and 200000 unique newidx
values.
If I perform the following operation
df.pivot_table(values='val', index='newidx', columns='Code', aggfunc='max')
I get a MemoryError
. but this sounds strange as the size of the resulting dataframe should be sustainable: 200000x600.
How much memory requires such operation? Is there a way to fix this memory error?
Upvotes: 10
Views: 13908
Reputation: 2527
I've had a very similar problem when carrying out a merge between 4 dataframes recently.
What worked for me was disabling the index during the groupby, then merging.
if @Kartiks answer doesn't work, try this before chunking the DataFrame.
df.groupby(['newidx', 'Code'], as_index=False)['val'].max().unstack()
Upvotes: 3
Reputation: 8693
Try to see if this fits in your memory:
df.groupby(['newidx', 'Code'])['val'].max().unstack()
pivot_table
is unfortunately very memory intensive as it may make multiple copies of data.
If the groupby
does not work, you will have to split your DataFrame into smaller pieces. Try not to assign multiple times. For example, if reading from csv:
df = pd.read_csv('file.csv').groupby(['newidx', 'Code'])['val'].max().unstack()
avoids multiple assignments.
Upvotes: 12