Gzip files with Spark

Question

I have a Spark job which takes several thousand files as input and downloads them from Amazon S3 and process them in the map phase, where each map step returns a string. I'd like to compress outputs to .tar.gz file and upload it to S3 afterwards. One way to do it is

outputs = sc.map(filenames).collect()
for output in outputs:
    with tempfile.NamedTemporaryFile() as tar_temp:
        tar = tarfile.open(tar_temp.name, "w:gz")
        for output in outputs:
            with tempfile.NamedTemporaryFile() as output_temp:
                output_temp.write(output)
                tar.add(output_temp.name)
        tar.close()

the problem is that outputs don't fit into memory (but they fit on disk). Is there a way to save the outputs to master filesystem in map phase? Or perhaps use loop for output in outputs as a generator so that I don't have to load everything into memory?

Olivier Girardot · Accepted Answer

In Spark 1.3.0 you will be able to use the same Java/Scala method toLocalIterator in Python.

The pull request has been merged : https://github.com/apache/spark/pull/4237

Here's the designated documentation :

    """
    Return an iterator that contains all of the elements in this RDD.
    The iterator will consume as much memory as the largest partition in this RDD.
    >>> rdd = sc.parallelize(range(10))
    >>> [x for x in rdd.toLocalIterator()]
    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
    """

All in all, it will allow you to iterate on your outputs, without collecting everything to the driver.

Regards,

Gzip files with Spark

Answers (1)

Related Questions