kaminsknator
kaminsknator

Reputation: 1183

strip header from binary file

I have a raw binary file that is several gigs and I'm attempting to process it in chunks. Before I can start processing the data I have to remove the header that it has. None of the string methods like .find or checking for string in data chunk works because of the raw binary file format. I would like to automatically strip the header but it can vary in length and my current approach of looking for the last new line character didn't work because the raw binary data has matching bits in the data.

Data format:
BEGIN_HEADER\r\n
header of various line count\r\n
HEADER_END\r\n raw data starts here

how i'm reading in the file

filename="binary_filename"
chunksize=1024
with open(filename, "rb") as f:
    chunk = f.read(chunksize)
    for index, byte in enumerate(chunk):
        if byte == ord('\n'):
            print("found one " + str(index))

Is there a simple way to extract the HEADER_END\r\n line without sliding a byte array through the file? current approach:

chunk = f.read(chunksize)
index=0
not_found=True
while not_found:
    if chunk[index:index+12] == b'HEADER_END\r\n':
        print("found")
        not_found=False
    index+=1

Upvotes: 0

Views: 3272

Answers (1)

Stram
Stram

Reputation: 826

You could use linecache:

import linecache
currentline = 0
while(linecache.getline("file.bin",currentline)!="HEADER_END\n"):
    currentline=currentline+1

#print raw data
currentline = currentline + 1
rawdata = linecache.getline("file.bin",currentline)
currentrawdata = rawdata
while(currentrawdata):
    currentrawdata = linecache.getline("file.bin",currentline+1)
    rawdata = rawdata + currentrawdata
    currentline = currentline + 1
print rawdata

UPDATE

We can split the problem in two, first we can remove the header, then we can read it into chunks:

lines= open('test_file.bin').readlines()
currentline = 0
while(lines[currentline] != "HEADER_END\r\n"):
     currentline=currentline+1
open('newfile.bin', 'w').writelines(lines[currentline:-1])

A file ( newfile.bin ) will be created containing just the raw data. Now it can be read direclty in chunks:

chunksize=1024
with open('newfile.bin', "rb") as f:
    chunk = f.read(chunksize)

UPDATE 2

It is also possible to do this without using the intermediate file:

#defines the size of the chunks
chunksize=20
filename= 'test_file.bin'
endHeaderTag = "HEADER_END\r\n"
#Identifies at which line there is HEADER_END
lines= open(filename).readlines()
currentline = 0
while(lines[currentline] != endHeaderTag):
     currentline=currentline+1
currentline=currentline+1
#Now currentline contains the index of the first line to the raw data

#With the reduce operation we generate a single string from the list of lines
#we are considering only the lines after the currentline
header_stripped = reduce(lambda x,y:x+y,lines[currentline:])

#Lastly we read successive chunks and we store them into the chunk list.
chunks = []
reminder = len(header_stripped)%chunksize
for i in range(1,len(header_stripped)/chunksize + reminder):
    chunks.append( header_stripped[(i-1)*chunksize:i*chunksize])

Upvotes: 1

Related Questions