tgpz
tgpz

Reputation: 181

What's the best way to read a text file from within a tar archive as strings (not byte strings)?

I have some text files I need to process (but not extract) from a tar archive. I have working python 2 code which I am trying to uplift to python 3. Unfortunately python 3 is returning byte strings which the rest of the code cannot process correctly. I need to convert the byte strings to strings. A simple example looks like this:

import tarfile
with tarfile.open("file.tar") as tar:
    with tar.extractfile("test.txt") as extracted:
        lines = extracted.readlines()
        print(lines)

The result is:

['a\n', 'test\n', 'file\n']    # python 2
[b'a\n', b'test\n', b'file\n'] # python 3

Below are some current attempts at fixing, which work, however it feels awkward that I would need to use a triple with statement, list comprehension or map just to read some text:

with io.TextIOWrapper(extracted) as txtextracted:
    lines = txtextracted.readlines()

# or
lines = [i.decode("utf-8") for i in lines]

# or
lines = list(map(lambda x: x.decode("utf-8"),lines))

I cannot find a neater solution in the io.BufferedReader documentation (this is the object which TarFile.extractfile returns). I have tried to come up with solutions but none are as neat as the python 2 solution. Is there a neat and pythonic way to parse the tar file's io.BufferedReader object as strings?

Upvotes: 0

Views: 1355

Answers (1)

Karl Knechtel
Karl Knechtel

Reputation: 61527

The with statement allows for multiple context managers, and as it turns out, their construction may depend on previous ones in the chain - example:

class manager:
    def __init__(self, name, child=None):
        self.name, self.child = name, child
    def __exit__(self, t, value, traceback):
        print('exiting', self)
    def __enter__(self):
        print('entering', self)
        return self
    def __str__(self):
        childname = None if self.child is None else f"'{self.child.name}'"
        return f"manager '{self.name}' with child {childname}"

Testing it:

>>> with manager('x') as x, manager('y', x) as y, manager('z', y) as z: pass
...
entering manager 'x' with child None
entering manager 'y' with child 'x'
entering manager 'z' with child 'y'
exiting manager 'z' with child 'y'
exiting manager 'y' with child 'x'
exiting manager 'x' with child None

Thus:

with tarfile.open("file.tar") as tar, tar.extractfile("test.txt") as binary, io.TextIOWrapper(binary) as text:
    lines = text.readlines()

(Although I don't think you really need to manage all those contexts anyway...)

Upvotes: 3

Related Questions