Reputation: 672
I have a large tar file containing millions of files. For efficiency reasons I don't want to untar the files to disk.
Rather, given a desired filename, I would like to write a script e.g. Python to pull the relevant chunk of data from the tar file.
Is there an easy way to create an index telling me the starting byte and length of every file in the tar file e.g. I could dump out to disk as an index for use in the abovementioned Python script?
Maybe the tar command can do this but I'm not seeing anything obvious in the man page.
The tar is not compressed.
Thanks in advance.
Upvotes: 4
Views: 2619
Reputation: 672
For the benefit of others with a similar use case (i.e. wanting to build an index enabling random access on a tar file) in the end I adapted a handy utility at http://fomori.org/blog/?p=391 the essence of which is (in Python):
fp = open('index.txt', 'w') ctr = 0 with tarfile.open(tarfname, 'r') as db: for tarinfo in db: currentseek = tarinfo.offset_data rec = "%d\t%d\t%d\t%s\n" % (ctr,tarinfo.offset_data, tarinfo.size, tarinfo.name) fp.write(rec) ctr += 1 if ctr % 1000 == 0: db.members = [] fp.close()
The check at %1000 conserves RAM. I'm sure this could be neater.
Upvotes: 4
Reputation: 1713
Python code performs not very well. I use below awk scripts to do that for a big tar file.
tar -tvf <tar-file> -R | awk '
BEGIN{
getline;
f=$8;
s=$5;
}
{
offset = int($2) * 512 - and((s+511), -512)
print offset,s,f;
f=$8;
s=$5;
}'
Upvotes: 4