fodon
fodon

Reputation: 4645

python subprocess with gzip

I am trying to stream data through a subprocess, gzip it and write to a file. The following works. I wonder if it is possible to use python's native gzip library instead.

fid = gzip.open(self.ipFile, 'rb') # input data
oFid = open(filtSortFile, 'wb') # output file
sort = subprocess.Popen(args="sort | gzip -c ", shell=True, stdin=subprocess.PIPE, stdout=oFid) # set up the pipe
processlines(fid, sort.stdin, filtFid) # pump data into the pipe

THE QUESTION: How do I do this instead .. where the gzip package of python is used? I'm mostly curious to know why the following gives me a text files (instead of a compressed binary version) ... very odd.

fid = gzip.open(self.ipFile, 'rb')
oFid = gzip.open(filtSortFile, 'wb')
sort = subprocess.Popen(args="sort ", shell=True, stdin=subprocess.PIPE, stdout=oFid)
processlines(fid, sort.stdin, filtFid)

Upvotes: 5

Views: 6451

Answers (3)

David Streuli
David Streuli

Reputation: 81

Yes, it is possible to use python's native gzip library instead. I recommend looking at this question: gzip a file in Python.

I'm now using Jace Browning's answer:

with open('path/to/file', 'rb') as src, gzip.open('path/to/file.gz', 'wb') as dst:
    dst.writelines(src)

Although one comments raises you have to convert the src content to bytes, it is not required with this code.

Upvotes: 0

jfs
jfs

Reputation: 414179

subprocess writes to oFid.fileno() but gzip returns fd of underlying file object:

def fileno(self):
    """Invoke the underlying file object's fileno() method."""
    return self.fileobj.fileno()

To enable compression use gzip methods directly:

import gzip
from subprocess import Popen, PIPE
from threading import Thread

def f(input, output):
    for line in iter(input.readline, ''):
        output.write(line)

p = Popen(["sort"], bufsize=-1, stdin=PIPE, stdout=PIPE)
Thread(target=f, args=(p.stdout, gzip.open('out.gz', 'wb'))).start()

for s in "cafebabe":
    p.stdin.write(s+"\n")
p.stdin.close()

Example

$ python gzip_subprocess.py  && od -c out.gz && zcat out.gz 
0000000 037 213  \b  \b 251   E   t   N 002 377   o   u   t  \0   K 344
0000020   J 344   J 002 302   d 256   T       L 343 002  \0   j 017   j
0000040   k 020  \0  \0  \0
0000045
a
a
b
b
c
e
e
f

Upvotes: 7

steabert
steabert

Reputation: 6878

Since you just specify the file handle to give to the process you're executing, there are no further methods involved of the file object. To work around it, you could write your output to a pipe and read from that like so:

oFid = gzip.open(filtSortFile, 'wb')
sort = subprocess.Popen(args="sort ", shell=True, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
oFid.writelines(sort.stdout)
oFid.close()

Upvotes: 2

Related Questions