Reputation: 610
I'm quite new to spark, I've imported pyspark library to pycharm venv and write below code:
# Imports
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder \
.appName('DataFrame') \
.master('local[*]') \
.getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", 5)
path = "file_path"
df = spark.read.format("avro").load(path)
, everything seems to be okay but when I want to read avro file I get message:
pyspark.sql.utils.AnalysisException: 'Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;'
When I go to this page: >https://spark.apache.org/docs/latest/sql-data-sources-avro.html there is something like this:
and I have no idea have to implement this, download something in PyCharm or you have to find external files to modify?
Thank you for help!
Update (2019-12-06): Because I'm using Anaconda I've opened Anaconda prompt and copied this code:
pyspark --packages com.databricks:spark-avro_2.11:4.0.0
It downloaded some modules, then I've got back to PyCharm and same error appears.
Upvotes: 1
Views: 7274
Reputation: 1
Your Spark version and your avro JAR version should be in sync
ex: If you're using spark 3.1.2 and your avro jar version should be spark-avro_2.12-3.1.2.jar
Sample Code:
spark = SparkSession.builder.appName('DataFrame').\
config('spark.jars','C:\\Users\\<<User_Name>>\\Downloads\\spark-avro_2.12-3.1.2.jar').getOrCreate()
df = spark.read.format('avro').load('C:\\Users\\<<user name>>\\Downloads\\sample.avro')
df.show()
Output:
+-------------------+-------+------+------------+------------+--------------------+-------+-----------+--------------------+-----------------+------------------+-------+------------+--------------+--------------+----------+--------------------+
| datetime|country|region|publisher_id|placement_id| impression_id|consent| hostname| uuid|placement_type_id|iab_device_type_id|site_id|request_type|placement_type|bid_url_domain|app_bundle| tps|
+-------------------+-------+------+------------+------------+--------------------+-------+-----------+--------------------+-----------------+------------------+-------+------------+--------------+--------------+----------+--------------------+
|2021-07-30 14:55:18| null| null| 5016| 5016|8bdf2cf1-3a17-473...| 4|test.server|9515d578-9ee0-462...| 0| 5| 5016| advast| video| null| null|{5016 -> {5016, n...|
|2021-07-30 14:55:22| null| null| 2702| 2702|ab3b63d1-a916-4d7...| 4|test.server|9515d578-9ee0-462...| 1| 2| 2702| adi| banner| null| null|{2702 -> {2702, n...|
|2021-07-30 14:55:24| null| null| 1106| 1106|574f078f-0fc6-452...| 4|test.server|9515d578-9ee0-462...| 1| 2| 1106| adi| banner| null| null|{1106 -> {1106, n...|
|2021-07-30 14:55:25| null| null| 1107| 1107|54bf5cf8-3438-400...| 4|test.server|9515d578-9ee0-462...| 1| 2| 1107| adi| banner| null| null|{1107 -> {1107, n...|
|2021-07-30 14:55:27| null| null| 4277| 4277|b3508668-3ad5-4db...| 4|test.server|9515d578-9ee0-462...| 1| 2| 4277| adi| banner| null| null|{4277 -> {4277, n...|
+-------------------+-------+------+------------+------------+--------------------+-------+-----------+--------------------+-----------------+------------------+-------+------------+--------------+--------------+----------+--------------------+
Upvotes: 0
Reputation: 354
pyspark --jars /<path_to>/spark-avro_<version>.jar
works for me with Spark 3.0.2
Upvotes: 2
Reputation: 169
Simple solution can be submitting the module in Terminal
tab inside pycharm with spark-submit
command as below.
General syntax of command:
spark-submit --packages <package_name> <script_path>
As avro is the package needed com.databricks:spark-avro_2.11:4.0.0
package should be included. So the final command will be
spark-submit --packages com.databricks:spark-avro_2.11:4.0.0 <script_path>
Upvotes: 0
Reputation: 837
I downloaded the pyspark
version 2.4.4
package from conda in PyCharm. And added spark-avro_2.11-2.4.4.jar
file in spark configuration and was able to sucessfully recreate your error i.e, pyspark.sql.utils.AnalysisException: 'Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;'
To fix this issue, follow below steps:
spark-2.4.4-bin-hadoop2.7.tgz
from here.SPARK_HOME
to <download_path>/spark-2.4.3-bin-hadoop2.7
and set PYTHONPATH
to $SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$SPARK_HOME/python
spark-avro_2.11-2.4.4.jar
file from here.Now you should be able to run pyspark code from PyCharm. Try below code:
# Imports
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
#Create SparkSession
spark = SparkSession.builder \
.appName('DataFrame') \
.master('local[*]')\
.config('spark.jars', '<path>/spark-avro_2.11-2.4.4.jar') \
.getOrCreate()
df = spark.read.format('avro').load('<path>/userdata1.avro')
df.show()
The above code will work but PyCharm will complain about pyspark modules. To remove that and enable code completion feature follow below additional steps:
spark-2.4.3-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip
Now your project structure should look like:
Upvotes: 4