Reputation: 3
I have below python code which compares source system file with the target system file. Both the files are comma separated but can come with extension of .txt or .csv. However, the files will be comma separated. The code tries to list out unique records from source file (which are not in target file) and target file (which are not in source file). This code works well when the file size is less than 1GB. But for some reconciliation, the entire table data was sent which has million of records causing the file size to be around 4-5GB. As a result, when I execute the code in Pycharm, I get memory utilization error and the Pycharm is closed. Any insight on how to overcome the memory issue? I tried changing the heap size to 8096MB but even then I am getting the same issue.
File Layout (for both source system and target system):
12340012,01,78978912,12345678912345
1213456789,02,56789012,1234567890121
Here field 1 can be up to 10 chars, field 2 up to 2 chars, field 3 up to 8 chars, field 4 up to 14 chars
Python Code:
#load data to Dframe Object
print("Load started for DataFrame - Target File", datetime.datetime.now())
df_target=pd.read_csv(TargetFile, names=["CardholdNumber", "MemberNo", "CrdSponsorNo"],
dtype={"CardholdNumber": str, "MemberNo": str, "CrdSponsorNo": str})
print("Load ended for DataFrame - Target File", datetime.datetime.now())
print('//')
print("Load started for Header in DataFrame - Target File", datetime.datetime.now())
df_rxods10.insert(0, "Source", "rods10")
print("Load ended for Header in DataFrame - Target File", datetime.datetime.now())
print('//')
print("Load started for DataFrame - Source File", datetime.datetime.now())
df_source=pd.read_csv(SourceFile, names=["CardholdNumber", "MemberNo", "CrdSponsorNo"],
dtype={"CardholdNumber": str, "MemberNo": str, "CrdSponsorNo": str})
print("Load ended for DataFrame - Source File", datetime.datetime.now())
print('//')
print("Load started for Header in DataFrame - Source File", datetime.datetime.now())
df_prod10.insert(0, "Source", "prod10")
print("Load ended for Header in DataFrame - Source File", datetime.datetime.now())
print('//')
#vertical con
print("DataFrame Concatenation started", datetime.datetime.now())
df=pd.concat([df_target,df_source],axis=0)
print("DataFrame Concatenation ended", datetime.datetime.now())
print('//')
print("DataFrame sorting started", datetime.datetime.now())
df1=df.sort_values(by=["CardholdNumber", "MemberNo", "CrdSponsorNo"])
print("DataFrame sorting ended", datetime.datetime.now())
print('//')
print("DataFrame drop_duplicate started", datetime.datetime.now())
newdf = df1.drop_duplicates(subset=["CardholdNumber", "MemberNo", "CrdSponsorNo"], keep=False)
print("DataFrame drop_duplicate ended", datetime.datetime.now())
print('//')
print("DataFrame write to output file started", datetime.datetime.now())
newdf.to_csv(diffFile)
print("DataFrame write to output file ended", datetime.datetime.now())
print('//')
I tried changing the heap size to 8096MB but even then getting the same memory related issue
Upvotes: 0
Views: 48
Reputation: 13097
This might help get you started. We are going to use the hashlib.sha256()
of each row to construct a key and then use it to determine unique rows in each dataset.
Hopefully we only need to have a dictionary of these hashes from dataset 1 as the high water mark.
import random
import string
import hashlib
import csv
## ---------------------------
## Create a bunch of random test rows
## ---------------------------
data = [
[
"".join(random.choice(string.digits) for _ in range(10)),
"".join(random.choice(string.digits) for _ in range(2)),
"".join(random.choice(string.digits) for _ in range(8)),
"".join(random.choice(string.digits) for _ in range(14)),
]
for _ in range(100)
]
## ---------------------------
## ---------------------------
## create test data files
## ensuring that each file has at least one distinct row.
## ---------------------------
with open("data_a.txt", "w", newline="") as file_out:
csv.writer(file_out).writerows(data[5:])
with open("data_b.csv", "w", newline="") as file_out:
csv.writer(file_out).writerows(data[:-5])
## ---------------------------
## ---------------------------
## Read the first file and use the hash of the row
## as a dictionary key with the value of the index of that row
## ---------------------------
distinct_a = {}
with open("data_a.txt", "r") as file_in:
for index, row in enumerate(file_in):
key = hashlib.sha256(row.encode()).digest()
distinct_a[key] = index
## ---------------------------
## Read the second file and hash the row.
## if that hash is in the dictionary remove that key
## otherwise it is distinct to the second file
## ---------------------------
distinct_b = []
with open("data_b.csv", "r") as file_in:
for row in file_in:
key = hashlib.sha256(row.encode()).digest()
if key in distinct_a:
del distinct_a[key]
else:
distinct_b.append(row)
## ---------------------------
## ---------------------------
## re-read the first file to get the distict
## rows by index. At this point the dictionary
## should only have entries distinct to it
## ---------------------------
with open("data_a.txt", "r") as file_in:
distinct_a = [
line
for index, line
in enumerate(file_in)
if index in distinct_a.values()
]
## ---------------------------
print("Distinct rows of data_a.txt")
for row in distinct_a:
print(f"\t{row.strip()}")
print("Distinct rows of data_b.csv")
for row in distinct_b:
print(f"\t{row.strip()}")
That should give you something like:
Distinct rows of data_a.txt
1334346746,66,62164494,71677917539944
3565821459,34,49828768,96013202928584
1092276254,71,02236156,81649689378691
5227992584,84,65952571,24866977701386
5246876395,96,83509545,66553005639782
Distinct rows of data_b.csv
3116989960,09,95653733,23178941469168
2693792822,43,68213056,47212574630967
7087320938,80,47462357,99362176025434
9527865125,68,52772900,55461281823636
0039797355,42,40483637,14285794991838
Upvotes: 0