Reputation: 164
I am stuck at this second last statement clueless. The error is : numpy.core._exceptions.MemoryError: Unable to allocate 58.1 GiB for an array with shape (7791676634,) and data type int64
My thinking was that merging a data frame of ~12 million records with another data frame of 3-4 more columns should not be a big deal. Please help me out. Totally stuck here. Thanks
Select_Emp_df has around 900k records and Big_df has around 12 million records and 9 columns. I just need to merge two DFs like we do vlookup in Excel on key column.
import pandas as pd
Emp_df = pd.read_csv('New_Employee_df.csv', low_memory = False )
# Append data into one data frame from three csv files of 3 years'
transactions
df2019 = pd.read_csv('U21_02767G - Customer Trade Info2019.csv',
low_memory = False )
df2021 = pd.read_csv('U21_02767G - Customer Trade
Info2021(TillSep).csv', low_memory = False)
df2020 = pd.read_csv('Newdf2020.csv', low_memory = False)
Big_df = pd.concat([df2019, df2020, df2021], ignore_index=True)
Select_Emp_df = Emp_df[['CUSTKEY','GCIF_GENDER_DSC','SEX']]
Big_df = pd.merge(Big_df, Select_Emp_df, on='CUSTKEY')
print (Big_df.info)
Upvotes: 1
Views: 1388
Reputation: 708
Another approach would be to do the merge manually. In my case one dataset is huge [9000000, 580]
and the other one is small [10000, 3]
.
The problem with merging normally is that when you merge two data frames, first it creates the third dataframe which is the result of the merge and then it assigns it to the variable. It means that there is a point in time that your memory needs to be able to hold df1
, df2
and the result of the join at the same time.
So if you have two data frames df1 and df2 and the merge result is not going to shrink in size as much, then when you run:
df1 = pd.merge([df1, df2])
before the result of the merge is assigned back to df1, it is in memory. In order to prevent the creation of the copy of the extra dataframe, we can do the join manually which is not recommended and is not the most computationally efficient way of doing a join but it is definitely more memory efficient:
for col in [columns, that, you, want, to, add, to, df1, from, df2]:
dict = df2.set_index('joining_column_in_df2')[col].to_dict()
df1[col] = df1['joining_column_in_df1'].apply(lambda x: labels_dict.get(x))
at the end df1
is the result of the merge. In case df2
is way smaller than df1
, this approach actually beats normal merge because it skips the overhead of creating another huge dataset and deleting the original one.
Upvotes: 0
Reputation: 164
Using suggestions provided by community here and a bit of my research I edited my codes and here is what worked for me -
Select_Emp_df['LSTTRDDT'] = pd.to_datetime(Select_Emp_df['LSTTRDDT'],errors = 'coerce') Select_Emp_df = Select_Emp_df.sort_values(by='LSTTRDDT',ascending=True) Select_Emp_df = Select_Emp_df.drop_duplicates(subset='CUSTKEY',keep='last')
I just sorted values on Last transaction date and deleted duplicates(IN CUSTKEY) in Select_Emp_df
data frame.
Upvotes: 0
Reputation: 148870
When younger, I used machines where available RAM per process where 32k to 640k. And I used to process huge datasets on that (err... several Mo but much larger than memory). The key was to only keep in memory what was required.
Here you concat 3 large dataframes to later merge that with another one. If you have memory issues, just reverse concat and merging: merge each individual file with Emp_df
and immediately write the merged file to the disk an throw everything out of your memory between each step. If you use csv files, you can even directly build the contatenated csv file by appending the 2nd and 3rd merge files to the first one (use mode='a', header=False
in to_csv
method).
Upvotes: 1
Reputation: 1480
Just before Big_df = pd.merge(Big_df, Select_Emp_df, on='CUSTKEY')
try to delete previous dataframes. Like this.
del df2019
del df2020
del df2021
This should save some memory
also try
Select_Emp_df = Emp_df[['CUSTKEY','GCIF_GENDER_DSC','SEX']].drop_duplicates(subset=['CUSTKEY'])
Upvotes: 1