Method to get file byte offsets (and lengths) in tar files

Question

I have a large tar file containing millions of files. For efficiency reasons I don't want to untar the files to disk.

Rather, given a desired filename, I would like to write a script e.g. Python to pull the relevant chunk of data from the tar file.

Is there an easy way to create an index telling me the starting byte and length of every file in the tar file e.g. I could dump out to disk as an index for use in the abovementioned Python script?

Maybe the tar command can do this but I'm not seeing anything obvious in the man page.

The tar is not compressed.

Thanks in advance.

jcollomosse · Accepted Answer

For the benefit of others with a similar use case (i.e. wanting to build an index enabling random access on a tar file) in the end I adapted a handy utility at http://fomori.org/blog/?p=391 the essence of which is (in Python):

fp = open('index.txt', 'w')
ctr = 0
with tarfile.open(tarfname, 'r') as db:
  for tarinfo in db:
     currentseek = tarinfo.offset_data
     rec = "%d	%d	%d	%s
" % (ctr,tarinfo.offset_data, tarinfo.size, tarinfo.name)
     fp.write(rec)
     ctr += 1
     if ctr % 1000 == 0:
        db.members = []
fp.close()

The check at %1000 conserves RAM. I'm sure this could be neater.

Method to get file byte offsets (and lengths) in tar files

Answers (2)

Related Questions