Romain Jouin
Romain Jouin

Reputation: 4848

spark 2.0 - java.io.IOException: Cannot run program "jupyter": error=2, No such file or directory

I am using jupyter notebooks to try spark.

Once in my notetbook I try a Kmean:

from pyspark.ml.clustering import KMeans
from sklearn               import datasets
import pandas as pd

spark = SparkSession\
        .builder\
        .appName("PythonKMeansExample")\
        .getOrCreate()

iris       = datasets.load_iris()
pd_df      = pd.DataFrame(iris['data'])
spark_df   = spark.createDataFrame(pd_df, ["features"])
estimator  = KMeans(k=3, seed=1)

Everything goes fine, then I fit the model :

estimator.fit(spark_df)

And I got an error :

16/08/16 22:39:58 ERROR Executor: Exception in task 0.2 in stage 0.0 (TID 24)
java.io.IOException: Cannot run program "jupyter": error=2, No such file or directory

Caused by: java.io.IOException: error=2, No such file or directory

Where is spark looking for Jupyter ? Why can't it find it if I can use jupyter notebook ? What to do ?..

Upvotes: 5

Views: 11383

Answers (1)

fandyst
fandyst

Reputation: 2830

as code says in https://github.com/apache/spark/blob/master/python/pyspark/context.py#L180

self.pythonExec = os.environ.get("PYSPARK_PYTHON", 'python')

so I think this error is caused by env variable PYSPARK_PYTHON, it indicates that python location of each spark node, when pyspark started, PYSPARK_PYTHON which is from sys env will be injected to all sparknodes, so that

  1. it can be solved by

    export PYSPARK_PYTHON=/usr/bin/python
    

    which are the same version on diff nodes. and then start:

    pyspark
    
  2. if there is diff versions of python among local and diff nodes of cluster, another version conflicts error will occur.

  3. the version of the interactive python which you work in should be the same version with other nodes in cluster.

Upvotes: 5

Related Questions