How to import a packge from a local jar in pyspark?

Question

I am using pyspark to do some work on a csv file, hence I need to import package from spark-csv_2.10-1.4.0.jar downloaded from https://repo1.maven.org/maven2/com/databricks/spark-csv_2.11/1.4.0/spark-csv_2.11-1.4.0.jar

I downloaded the jar to my local due to proxy issue.

Can anyone tell me what is the right usage of referring to a local jar:

Here is the code I use:

pyspark --jars /home/rx52019/data/spark-csv_2.10-1.4.0.jar

it will take me to the pyspark shell as expected, however, when I run:

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',inferschema='true').load('hdfs://dev-icg/user/spark/routes.dat')

the route.dat is uploaded to hdfs already at hdfs://dev-icg/user/spark/routes.dat

It gives me error:

: java.lang.NoClassDefFoundError: org/apache/commons/csv/CSVFormat

If I run:

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',inferschema='true').load('routes.dat')

I get this error:

py4j.protocol.Py4JJavaError: An error occurred while calling o72.load. : java.lang.NoClassDefFoundError: Could not initialize class com.databricks.spark.csv.package$

Can anyone help to sort it out for me? Thank you very much. Any clue is appreciated.

How to import a packge from a local jar in pyspark?

Answers (1)

Related Questions