PasLeChoix
PasLeChoix

Reputation: 311

How to import a packge from a local jar in pyspark?

I am using pyspark to do some work on a csv file, hence I need to import package from spark-csv_2.10-1.4.0.jar downloaded from https://repo1.maven.org/maven2/com/databricks/spark-csv_2.11/1.4.0/spark-csv_2.11-1.4.0.jar

I downloaded the jar to my local due to proxy issue.

Can anyone tell me what is the right usage of referring to a local jar:

Here is the code I use:

pyspark --jars /home/rx52019/data/spark-csv_2.10-1.4.0.jar 

it will take me to the pyspark shell as expected, however, when I run:

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',inferschema='true').load('hdfs://dev-icg/user/spark/routes.dat')

the route.dat is uploaded to hdfs already at hdfs://dev-icg/user/spark/routes.dat

It gives me error:

: java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat

If I run:

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',inferschema='true').load('routes.dat')

I get this error:

py4j.protocol.Py4JJavaError: An error occurred while calling o72.load. : java.lang.NoClassDefFoundError: Could not initialize class com.databricks.spark.csv.package$

Can anyone help to sort it out for me? Thank you very much. Any clue is appreciated.

Upvotes: 0

Views: 1077

Answers (1)

Gaurav Dhama
Gaurav Dhama

Reputation: 1336

The correct way to do this would be to add the options (say if you are starting a spark shell)

spark-shell --packages com.databricks:spark-csv_2.11:1.4.0 --driver-class-path /path/to/csvfilejar.jar

I have not used the databricks csvjar directly, but I used a netezza connector to spark where they mention using this option

https://github.com/SparkTC/spark-netezza

Upvotes: 0

Related Questions