Reputation: 1819
I have a spark cluster I use in local mode. I want to read a csv with the databricks external library spark.csv. I start my app as follows:
import os
import sys
os.environ["SPARK_HOME"] = "/home/mebuddy/Programs/spark-1.6.0-bin-hadoop2.6"
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
from pyspark import SparkContext, SparkConf, SQLContext
try:
sc
except NameError:
print('initializing SparkContext...')
sc=SparkContext()
sq = SQLContext(sc)
df = sq.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("/my/path/to/my/file.csv")
When I run it, I get the following error:
java.lang.ClassNotFoundException: Failed to load class for data source: com.databricks.spark.csv.
My question: how can I load the databricks.spark.csv library INSIDE my python code. I don't want to load it from outside (using --packages) from instance.
I tried to add the following lines but it did not work:
os.environ["SPARK_CLASSPATH"] = '/home/mebuddy/Programs/spark_lib/spark-csv_2.11-1.3.0.jar'
Upvotes: 2
Views: 3074
Reputation: 330263
If you create SparkContext
from scratch you can for example set PYSPARK_SUBMIT_ARGS
before SparkContext
is intialized:
os.environ["PYSPARK_SUBMIT_ARGS"] = (
"--packages com.databricks:spark-csv_2.11:1.3.0 pyspark-shell"
)
sc = SparkContext()
If for some reason you expect that SparkContext
has been already initialized, as it is suggested by your code, this won't work. In local mode you could try to use Py4J gateway and URLClassLoader
but it doesn't look like a good idea and won't work in a cluster mode.
Upvotes: 2