Reputation: 3806
I am trying to grab a single file from a tar archive. I have the tarfile library and I can do things like find the file in a list with the right extension:
like their example:
def xml_member_files(self,members):
for tarinfo in members:
if os.path.splitext(tarinfo.name)[1] == ".xml":
yield tarinfo
member_file = self.xml_member_files(tar)
for m in member_file:
print m.name
This is great and the output is:
RS2_C0RS2_OK67683_PK618800_DK549742_SLA23_20151006_234046_HH_SLC/lutBeta.xml
RS2_C0RS2_OK67683_PK618800_DK549742_SLA23_20151006_234046_HH_SLC/lutGamma.xml
RS2_C0RS2_OK67683_PK618800_DK549742_SLA23_20151006_234046_HH_SLC/lutSigma.xml
RS2_C0RS2_OK67683_PK618800_DK549742_SLA23_20151006_234046_HH_SLC/product.xml
If I say just look for product.xml then it doesn't work. So I tried this:
ti = tar.getmember('product.xml')
print ti.name
and it doesn't find product.xml because I am guessing the path information before hand. I have no idea how to retrieve just that pathing information so I can get at my product.xml file once extracted (feels like I am doing things the hard way anyway) but yah, how do I figure out just that path so I can concatenate it to my other file functions to read and load that xml file after it is the only file extracted from a tar file?
Upvotes: 1
Views: 2100
Reputation: 675
You don't want to be iterating over the entire tar with getnames()
, getmember()
or getmembers()
, because as soon as you find your file, you don't need to keep looking through the rest of the tar.
for example, it takes my machine about 47ms to extract a single file from a 2GB tar by iterating over all the file names:
with tarfile.open('/tmp/2GB-file.tar', mode='r:') as tar:
membername = [x for x in tar.getnames() if x.endswith('myfile.txt')][0]
file = tar.extractfile(membername).read().decode()
But stopping as soon as the file is found takes me only 0.27 ms, nearly 175x faster.
file = None
with tarfile.open('/tmp/2GB-file.tar', mode='r:') as tar:
for member in tar:
if member.name.endswith('myfile.txt'):
file = tar.extractfile(member).read().decode()
break
Note if the file you need is more near the end of the archive, you probably won't notice much of a change in speed, but it is still a good practice to not loop through the whole file if you don't have to.
Upvotes: 1
Reputation: 4551
Return full path by iterating over result of getnames()
. For example, to get full path for lutBeta.xml
:
tar = tarfile.TarFile('mytarfile.tar')
membername = [x for x in tar.getnames() if os.path.basename(x) == 'lutBeta.xml'][0]
Upvotes: 3
Reputation: 1579
I would try first doing TarFile.getnames()
, which I imagine works a lot like tar tzf filename.tar.gz
from the command line. Then you get find out what paths to feed to your getmember() or getmembers().
Upvotes: 1