Shri
Shri

Reputation: 164

memory issue when merging two data frames

I am stuck at this second last statement clueless. The error is : numpy.core._exceptions.MemoryError: Unable to allocate 58.1 GiB for an array with shape (7791676634,) and data type int64

My thinking was that merging a data frame of ~12 million records with another data frame of 3-4 more columns should not be a big deal. Please help me out. Totally stuck here. Thanks

Select_Emp_df has around 900k records and Big_df has around 12 million records and 9 columns. I just need to merge two DFs like we do vlookup in Excel on key column.

import pandas as pd

Emp_df = pd.read_csv('New_Employee_df.csv', low_memory = False )

# Append data into one data frame from three csv files of 3 years' 
transactions
df2019 = pd.read_csv('U21_02767G - Customer Trade Info2019.csv', 
low_memory = False )
df2021 = pd.read_csv('U21_02767G - Customer Trade 
Info2021(TillSep).csv', low_memory = False)
df2020 = pd.read_csv('Newdf2020.csv', low_memory = False)

Big_df = pd.concat([df2019, df2020, df2021], ignore_index=True)

Select_Emp_df = Emp_df[['CUSTKEY','GCIF_GENDER_DSC','SEX']]

Big_df = pd.merge(Big_df, Select_Emp_df, on='CUSTKEY')
print (Big_df.info)

Upvotes: 1

Views: 1388

Answers (4)

Ehsan Fathi
Ehsan Fathi

Reputation: 708

Another approach would be to do the merge manually. In my case one dataset is huge [9000000, 580] and the other one is small [10000, 3].

The problem with merging normally is that when you merge two data frames, first it creates the third dataframe which is the result of the merge and then it assigns it to the variable. It means that there is a point in time that your memory needs to be able to hold df1, df2 and the result of the join at the same time.

So if you have two data frames df1 and df2 and the merge result is not going to shrink in size as much, then when you run:

df1 = pd.merge([df1, df2]) 

before the result of the merge is assigned back to df1, it is in memory. In order to prevent the creation of the copy of the extra dataframe, we can do the join manually which is not recommended and is not the most computationally efficient way of doing a join but it is definitely more memory efficient:

for col in [columns, that, you, want, to, add, to, df1, from, df2]:
    dict = df2.set_index('joining_column_in_df2')[col].to_dict()
    df1[col] = df1['joining_column_in_df1'].apply(lambda x: labels_dict.get(x))

at the end df1 is the result of the merge. In case df2 is way smaller than df1, this approach actually beats normal merge because it skips the overhead of creating another huge dataset and deleting the original one.

Upvotes: 0

Shri
Shri

Reputation: 164

Using suggestions provided by community here and a bit of my research I edited my codes and here is what worked for me -

Select_Emp_df['LSTTRDDT'] = pd.to_datetime(Select_Emp_df['LSTTRDDT'],errors = 'coerce') Select_Emp_df = Select_Emp_df.sort_values(by='LSTTRDDT',ascending=True) Select_Emp_df = Select_Emp_df.drop_duplicates(subset='CUSTKEY',keep='last')

I just sorted values on Last transaction date and deleted duplicates(IN CUSTKEY) in Select_Emp_df data frame.

Upvotes: 0

Serge Ballesta
Serge Ballesta

Reputation: 148870

When younger, I used machines where available RAM per process where 32k to 640k. And I used to process huge datasets on that (err... several Mo but much larger than memory). The key was to only keep in memory what was required.

Here you concat 3 large dataframes to later merge that with another one. If you have memory issues, just reverse concat and merging: merge each individual file with Emp_df and immediately write the merged file to the disk an throw everything out of your memory between each step. If you use csv files, you can even directly build the contatenated csv file by appending the 2nd and 3rd merge files to the first one (use mode='a', header=False in to_csv method).

Upvotes: 1

alparslan mimaroğlu
alparslan mimaroğlu

Reputation: 1480

Just before Big_df = pd.merge(Big_df, Select_Emp_df, on='CUSTKEY') try to delete previous dataframes. Like this.

del df2019
del df2020
del df2021

This should save some memory

also try

Select_Emp_df = Emp_df[['CUSTKEY','GCIF_GENDER_DSC','SEX']].drop_duplicates(subset=['CUSTKEY'])

Upvotes: 1

Related Questions