Read/Load avro file from s3 using pyspark

Using AWS glue developer endpoint Spark Version - 2.4 Python Version- 3

Code: df=spark.read.format("avro").load("s3://dataexport/users/prod-users.avro")

Getting the following error message while trying to read avro file: Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;

Found the following links, but not helpful to resolve my issue

https://spark.apache.org/docs/latest/sql-data-sources-avro.html[Apache Avro Data Source Guide][1]

Apache Avro as a Built-in Data Source in Apache Spark 2.4

Upvotes: 2

Answers (2)

Karthik

Reputation: 1171

Have you imported the package while starting the shell? If not you need to start a shell as below. Below package is applicable for spark 2.4+ version.

pyspark  --packages com.databricks:spark-avro_2.11:4.0.0

Also write as below inside read.format:

df=spark.read.format("com.databricks.spark.avro").load("s3://dataexport/users/prod-users.avro")

Note: For pyspark you need to write 'com.databricks.spark.avro' instead of 'avro'.

Upvotes: 1

SCouto

Reputation: 7928

You just need to import that package

 org.apache.spark:spark-avro_2.11:4.0.0

Check which version you need here

Upvotes: 1

Read/Load avro file from s3 using pyspark

Answers (2)

Related Questions