Memory management related issue while comparing files using pandas

Question

I have below python code which compares source system file with the target system file. Both the files are comma separated but can come with extension of .txt or .csv. However, the files will be comma separated. The code tries to list out unique records from source file (which are not in target file) and target file (which are not in source file). This code works well when the file size is less than 1GB. But for some reconciliation, the entire table data was sent which has million of records causing the file size to be around 4-5GB. As a result, when I execute the code in Pycharm, I get memory utilization error and the Pycharm is closed. Any insight on how to overcome the memory issue? I tried changing the heap size to 8096MB but even then I am getting the same issue.

File Layout (for both source system and target system):

12340012,01,78978912,12345678912345
1213456789,02,56789012,1234567890121

Here field 1 can be up to 10 chars, field 2 up to 2 chars, field 3 up to 8 chars, field 4 up to 14 chars

Python Code:

#load data to Dframe Object
print("Load started for DataFrame - Target File", datetime.datetime.now())
df_target=pd.read_csv(TargetFile, names=["CardholdNumber", "MemberNo", "CrdSponsorNo"],
                       dtype={"CardholdNumber": str, "MemberNo": str, "CrdSponsorNo": str})
print("Load ended for DataFrame - Target File", datetime.datetime.now())
print('//')


print("Load started for Header in DataFrame - Target File", datetime.datetime.now())
df_rxods10.insert(0, "Source", "rods10")
print("Load ended for Header in DataFrame - Target File", datetime.datetime.now())
print('//')


print("Load started for DataFrame - Source File", datetime.datetime.now())
df_source=pd.read_csv(SourceFile, names=["CardholdNumber", "MemberNo", "CrdSponsorNo"],
                      dtype={"CardholdNumber": str, "MemberNo": str, "CrdSponsorNo": str})
print("Load ended for DataFrame - Source File", datetime.datetime.now())
print('//')


print("Load started for Header in DataFrame - Source File", datetime.datetime.now())
df_prod10.insert(0, "Source", "prod10")
print("Load ended for Header in DataFrame - Source File", datetime.datetime.now())
print('//')


#vertical con
print("DataFrame Concatenation started", datetime.datetime.now())
df=pd.concat([df_target,df_source],axis=0)
print("DataFrame Concatenation ended", datetime.datetime.now())
print('//')


print("DataFrame sorting started", datetime.datetime.now())
df1=df.sort_values(by=["CardholdNumber", "MemberNo", "CrdSponsorNo"])
print("DataFrame sorting ended", datetime.datetime.now())
print('//')


print("DataFrame drop_duplicate started", datetime.datetime.now())
newdf = df1.drop_duplicates(subset=["CardholdNumber", "MemberNo", "CrdSponsorNo"], keep=False)
print("DataFrame drop_duplicate ended", datetime.datetime.now())
print('//')


print("DataFrame write to output file started", datetime.datetime.now())
newdf.to_csv(diffFile)
print("DataFrame write to output file ended", datetime.datetime.now())
print('//')

I tried changing the heap size to 8096MB but even then getting the same memory related issue

Memory management related issue while comparing files using pandas

Answers (1)

Related Questions