Reputation: 33
I have an 11GB CSV file which has some corrupt lines I have to delete, I have identified the corrupted lines numbers from an ETL interface.
My program runs with small datasets, however, when I want to run on the main file I'm getting MemoryError. Below the code I'm using Do you have any suggestion to make it work?
row_to_delete = 101068
filename = "EKBE_0_20180907_065907 - Copy.csv"
with open(filename, 'r', encoding='utf8' ,errors='ignore') as file:
data = file.readlines()
print(data[row_to_delete -1 ])
data [row_to_delete -1] = ''
with open(filename, 'wb',encoding="utf8",errors='ignore') as file:
file.writelines( data )
Error:
Traceback (most recent call last):
File "/.PyCharmCE2018.2/config/scratches/scratch_7.py", line 7, in <module>
data = file.readlines()
MemoryError
Upvotes: 0
Views: 107
Reputation: 1121266
Rather than read the whole list into memory, loop over the input file, and write all lines except the line you need to delete to the a new file. Use enumerate()
to keep a counter if you need to delete by index:
row_to_delete = 101068
filename = "EKBE_0_20180907_065907 - Copy.csv"
with open(filename, 'r', encoding='utf8', errors='ignore') as inputfile,\
open(filename + '.fixed', 'wb', encoding="utf8") as outputfile:
for index, line in enumerate(inputfile):
if index == row_to_delete:
continue # don't write the line that matches
outputfile.writeline(line)
Rather than use an index, you could even detect a bad line directly in code this way.
Note that this writes to a new file, with the same name but with .fixed
added.
You can move that file back to replace the old file if you want to, with os.rename()
, once you are done copying all but the bad line:
os.rename(filename + '.fixed', filename)
Upvotes: 2