Prabhat Ratnala
Prabhat Ratnala

Reputation: 705

How to read compressed avro files (.gz) in spark?

I am trying to read a gzip (.gz extension) avro file using spark but I am getting below error. I see from the documentation that spark should be able to read .gz files without any additional conversions (might be for csv/text files).

I tried running below command but it gives error:

df= spark.read.format("com.databricks.spark.avro").load("/user/data/test1.avro.gz")

Error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/hdp/2.6.1.0-129/spark2/python/pyspark/sql/readwriter.py", line 149, in load
    return self._df(self._jreader.load(path))
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/usr/hdp/2.6.1.0-129/spark2/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o72.load.
: java.io.IOException: Not an Avro data file
        at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:63)
        at com.databricks.spark.avro.DefaultSource$$anonfun$5.apply(DefaultSource.scala:80)
        at com.databricks.spark.avro.DefaultSource$$anonfun$5.apply(DefaultSource.scala:77)
        at scala.Option.getOrElse(Option.scala:121)
        at com.databricks.spark.avro.DefaultSource.inferSchema(DefaultSource.scala:77)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:184)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:184)
        at scala.Option.orElse(Option.scala:289)
        at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:183)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:135)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:280)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Thread.java:748)

Upvotes: 0

Views: 1532

Answers (1)

mattficke
mattficke

Reputation: 797

Compression in an avro file works by separately compressing the individual data blocks, the avro file itself is not compressed (docs). ORC and Parquet compression works in a similar way, this is how these formats can be splittable.

In other words, you can't run gzip on an uncompressed .avro file and read it directly, the way you can with plain text files.

Compression happens when you write the avro file, in spark this is controlled by either the spark.sql.avro.compression.codec SparkConf setting, or the compression option on the writer (docs).

Upvotes: 2

Related Questions