Reputation: 707
Hi I have a file with this data structure.
for each 3073 bytes:
<1 x label><3072 x pixel>
...
<1 x label><3072 x pixel>
the lable is between 0~9
Now I need to write a python script to read the file and to check every 3073 byte. if label is "1" then delete this 3073 bytes(label and pixel)
ex: 2 <1st 3072 bytes> 1 <2nd 3072 bytes> 9 <3rd 3072 bytes>....
after run the script:
output: 2 <1st 3072 bytes> 9 <3rd 3072 bytes>....
Now my solution is to
1. use loop check every 3073 bytes
if the label is 1:
then put the index to buffer
2. make a new file
loop each 3073 bytes
if this 3073 bytes index is in the buffer
then skip
but I found this is very inefficient. So Is there any other smarter solution?
Upvotes: 0
Views: 59
Reputation: 54293
This should be reasonably fast (a few seconds at most for a 150MB file) and would never hold much data in memory :
chunk_size = 3072
with open('newpixels.bin', 'wb') as new_file:
with open('pixels.bin', 'rb') as data:
while True:
label_and_pixels = data.read(1+chunk_size)
if not label_and_pixels:
break
elif label_and_pixels[0] != '1':
new_file.write(label_and_pixels)
With pixels.bin
as input :
1XXX2YYY2ZZZ3AAA1BBB2CCC
and chunk_size
set to 3
, it outputs :
2YYY2ZZZ3AAA2CCC
If you're sure that the algorithm is correct and the output data is fine, you could delete 'pixels.bin'
and rename 'newpixels.bin'
to 'pixels.bin'
at the end of your script.
Upvotes: 1
Reputation: 537
the following algorithm might be a bit better:
1. use loop to check all 3073 bytes
if the label is 1:
continue
else:
write byte to new file (?)
Upvotes: 0