Reputation: 2794
The file names don't end with .gz
and I cannot change them back as they are shared with other programs.
file1.log.gz.processed
is simply a csv
file. But how do I read it in pyspark
, preferably in pyspark.sql
?
I tried to specify the format and compression but couldn't find the correct key/value. E.g.,
sqlContext.load(fn, format='gz')
didn't work. Although Spark could deal with gz
files it seems to determine the codec from file names. E.g.,
sc.textFile(fn)
would work if the file ends with .gz
but not in my case.
How do I instruct Spark to use the correct codec? Thank you!
Upvotes: 4
Views: 10616
Reputation: 4600
You should not use .load that way, as it's deprecated (since version 1.4.0). You should use read.format(source).schema(schema).options(options).load()
.
sql_context.read.format("com.databricks.spark.csv")
.options(
header=... # e.g., "true"
inferSchema=...)
.load(file_path + ".gz")
Upvotes: 1