python - How to efficiently delete specified string in a big file?

Question

Hi I have a file with this data structure.

for each 3073 bytes:
<1 x label><3072 x pixel>
...
<1 x label><3072 x pixel>
the lable is between 0~9

Now I need to write a python script to read the file and to check every 3073 byte. if label is "1" then delete this 3073 bytes(label and pixel)

ex: 2 <1st 3072 bytes> 1 <2nd 3072 bytes> 9 <3rd 3072 bytes>....
after run the script:
    output:  2 <1st 3072 bytes> 9 <3rd 3072 bytes>....

Now my solution is to

1. use loop check every 3073 bytes 
   if the label is 1:
       then put the index to buffer
2. make a new file
   loop each 3073 bytes
   if this 3073 bytes index is in the buffer
       then skip

but I found this is very inefficient. So Is there any other smarter solution?

Eric Duminil · Accepted Answer

This should be reasonably fast (a few seconds at most for a 150MB file) and would never hold much data in memory :

chunk_size = 3072

with open('newpixels.bin', 'wb') as new_file:
    with open('pixels.bin', 'rb') as data:
        while True:
            label_and_pixels = data.read(1+chunk_size)
            if not label_and_pixels:
                break
            elif label_and_pixels[0] != '1':
                new_file.write(label_and_pixels)

With pixels.bin as input :

1XXX2YYY2ZZZ3AAA1BBB2CCC

and chunk_size set to 3, it outputs :

2YYY2ZZZ3AAA2CCC

If you're sure that the algorithm is correct and the output data is fine, you could delete 'pixels.bin' and rename 'newpixels.bin' to 'pixels.bin' at the end of your script.

python - How to efficiently delete specified string in a big file?

Answers (2)

Related Questions