Delete a Line from BIG CSV file Python

I have an 11GB CSV file which has some corrupt lines I have to delete, I have identified the corrupted lines numbers from an ETL interface.

My program runs with small datasets, however, when I want to run on the main file I'm getting MemoryError. Below the code I'm using Do you have any suggestion to make it work?

row_to_delete = 101068
filename = "EKBE_0_20180907_065907 - Copy.csv"
with open(filename, 'r', encoding='utf8' ,errors='ignore') as file:
    data = file.readlines()
    print(data[row_to_delete -1 ])
    data [row_to_delete -1] = ''
with open(filename, 'wb',encoding="utf8",errors='ignore') as file:
    file.writelines( data )

Error:

Traceback (most recent call last):
  File "/.PyCharmCE2018.2/config/scratches/scratch_7.py", line 7, in <module>
    data = file.readlines()
MemoryError

Upvotes: 0

Views: 107

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1121266

Rather than read the whole list into memory, loop over the input file, and write all lines except the line you need to delete to the a new file. Use enumerate() to keep a counter if you need to delete by index:

row_to_delete = 101068
filename = "EKBE_0_20180907_065907 - Copy.csv"
with open(filename, 'r', encoding='utf8', errors='ignore') as inputfile,\
     open(filename + '.fixed', 'wb', encoding="utf8") as outputfile:
    for index, line in enumerate(inputfile):
        if index == row_to_delete:
            continue  # don't write the line that matches
        outputfile.writeline(line)

Rather than use an index, you could even detect a bad line directly in code this way.

Note that this writes to a new file, with the same name but with .fixed added.

You can move that file back to replace the old file if you want to, with os.rename(), once you are done copying all but the bad line:

os.rename(filename + '.fixed', filename)

Upvotes: 2

Related Questions