Saving a gzip file with pydoop in python

Question

I'm using pydoop to read and write files in pyspark. I want to write my job output in gzip format. My current code looks like the following:

def create_data_distributed(workerNum,outputDir, centers, noSamples = 10, var = 0.1):
numCenters = centers.shape[0]
dim = centers.shape[1]
fptr_out = hdfs.hdfs().open_file(os.path.join(outputDir, ("part-%05d" % workerNum) ) + ".txt", "w")
for idx in range(noSamples):
    idxCenter = np.random.randint(numCenters)
    sample = centers[idxCenter] + np.random.normal(size=(1,dim))
    # output the sample. Need to 
    fptr_out.write("%d, " % idxCenter)
    for i in range(len(sample[0])):
        fptr_out.write("%f " %(sample[0][i]))
        if (i < (len(sample[0])-1)):
            fptr_out.write(",")
    fptr_out.write("
")
fptr_out.close()
return

How do I make this code to open and write a gzip file and not regular file?

Thanks!!!

Saving a gzip file with pydoop in python

Answers (1)

Related Questions