Reputation: 31262
I have a file that uses \x01
as line terminator. That is line terminator is NOT newline but the bytevalue of 001
. Here is the ascii representation for it which ^A
.
I want to split file to size of 10 MB each. Here is what I came up with
size=10000 #10 MB
i=0
with open("in-file", "rb") as ifile:
ofile = open("output0.txt","wb")
data = ifile.read(size)
while data:
ofile.write(data)
ofile.close()
data = ifile.read(size)
i+=1
ofile = open("output%d.txt"%(i),"wb")
ofile.close()
However, this would result in files that are broken at arbitrary places.
I want the files to be terminated only at the byte value of 001
and next read resumes from the next byte.
Upvotes: 0
Views: 226
Reputation: 114038
if its just one byte terminal you can do something like
def read_line(f_object,terminal_byte): # its one line you could just as easily do this inline
return "".join(iter(lambda:f_object.read(1),terminal_byte))
then make a helper function that will read all the lines in a file
def read_lines(f_object,terminal_byte):
tmp = read_line(f_object,terminal_byte)
while tmp:
yield tmp
tmp = read_line(f_object,terminal_byte)
then make a function that will chunk it up
def make_chunks(f_object,terminal_byte,max_size):
current_chunk = []
current_chunk_size = 0
for line in read_lines(f_object,terminal_byte):
current_chunk.append(line)
current_chunk_size += len(line)
if current_chunk_size > max_size:
yield "".join(current_chunk)
current_chunk = []
current_chunk_size = 0
if current_chunk:
yield "".join(current_chunk)
then just do something like
with open("my_binary.dat","rb") as f_in:
for i,chunk in enumerate(make_chunks(f_in,"\x01",1024*1000*10)):
with open("out%d.dat"%i,"wb") as f_out:
f_out.write(chunk)
there might be some way to do this with libraries (or even an awesome builtin way) but im not aware of any offhand
Upvotes: 1