Reputation: 1534
I'm using pydoop to read and write files in pyspark. I want to write my job output in gzip format. My current code looks like the following:
def create_data_distributed(workerNum,outputDir, centers, noSamples = 10, var = 0.1):
numCenters = centers.shape[0]
dim = centers.shape[1]
fptr_out = hdfs.hdfs().open_file(os.path.join(outputDir, ("part-%05d" % workerNum) ) + ".txt", "w")
for idx in range(noSamples):
idxCenter = np.random.randint(numCenters)
sample = centers[idxCenter] + np.random.normal(size=(1,dim))
# output the sample. Need to
fptr_out.write("%d, " % idxCenter)
for i in range(len(sample[0])):
fptr_out.write("%f " %(sample[0][i]))
if (i < (len(sample[0])-1)):
fptr_out.write(",")
fptr_out.write("\n")
fptr_out.close()
return
How do I make this code to open and write a gzip file and not regular file?
Thanks!!!
Upvotes: 2
Views: 702
Reputation: 74655
I would expect that you could do that by wrapping the returned file-like object:
fptr_out = hdfs.hdfs().open_file(...)
With gzip.GzipFile
like:
hdfs_file = hdfs.hdfs().open_file(...)
fptr_out = gzip.GzipFile(mode='wb', fileobj=hdfs_file)
Note that you have to call close on both:
fptr_out.close()
hdfs_file.close()
This is a lot more clear with the with
statement:
output_filename = os.path.join(outputDir, ("part-%05d" % workerNum) ) + ".txt.gz"
with hdfs.hdfs().open_file(output_filename, "wb") as hdfs_file:
with gzip.GzipFile(mode='wb', fileobj=hdfs_file) as fptr_out:
...
This is all untested. Use at your own risk.
Upvotes: 2