How do I load a gzip-compressed csv file in pyspark?

Question

The file names don't end with .gz and I cannot change them back as they are shared with other programs.

file1.log.gz.processed is simply a csv file. But how do I read it in pyspark, preferably in pyspark.sql?

I tried to specify the format and compression but couldn't find the correct key/value. E.g.,

sqlContext.load(fn, format='gz')

didn't work. Although Spark could deal with gz files it seems to determine the codec from file names. E.g.,

sc.textFile(fn)

would work if the file ends with .gz but not in my case.

How do I instruct Spark to use the correct codec? Thank you!

Answers (1)