shuaiyuancn
shuaiyuancn

Reputation: 2794

How do I load a gzip-compressed csv file in pyspark?

The file names don't end with .gz and I cannot change them back as they are shared with other programs.

file1.log.gz.processed is simply a csv file. But how do I read it in pyspark, preferably in pyspark.sql?

I tried to specify the format and compression but couldn't find the correct key/value. E.g.,

sqlContext.load(fn, format='gz')

didn't work. Although Spark could deal with gz files it seems to determine the codec from file names. E.g.,

sc.textFile(fn)

would work if the file ends with .gz but not in my case.

How do I instruct Spark to use the correct codec? Thank you!

Upvotes: 4

Views: 10616

Answers (1)

Markon
Markon

Reputation: 4600

You should not use .load that way, as it's deprecated (since version 1.4.0). You should use read.format(source).schema(schema).options(options).load().

sql_context.read.format("com.databricks.spark.csv")
.options(
  header=... # e.g., "true"
  inferSchema=...)
.load(file_path + ".gz")

Upvotes: 1

Related Questions