user5576922
user5576922

Reputation: 101

Setting up environment

I am using Google Colaboratory to learn about Pyspark. For some reason, when running the set up environment, I am getting an error message. This seems to happen when moving from one notebook to another.

Error message:

IndexError Traceback (most recent call last)
<ipython-input-1-047bb5b2397e> in <module>()
  4 
  5 import findspark
 ----> 6 findspark.init()
  7 from pyspark import SparkContext
  8 sc = SparkContext.getOrCreate()

 /usr/local/lib/python3.6/dist-packages/findspark.py in init(spark_home, 
 python_path, edit_rc, edit_profile)
133     # add pyspark to sys.path
134     spark_python = os.path.join(spark_home, 'python')
 --> 135     py4j = glob(os.path.join(spark_python, 'lib', 'py4j-*.zip')) 
[0]
136     sys.path[:0] = [spark_python, py4j]
137 

IndexError: list index out of range

The code provided by Google for setting up the environment:

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.3.1-bin-hadoop2.7"

import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext.getOrCreate()

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate() 
spark

Upvotes: 0

Views: 473

Answers (1)

Bob Smith
Bob Smith

Reputation: 38589

You'll need to install spark before running the snippet above. Here's an adjusted recipe that works for me with a fresh Colab backend:

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-2.3.2/spark-2.3.2-bin-hadoop2.7.tgz
!tar xf spark-2.3.2-bin-hadoop2.7.tgz
!pip install -q findspark

import os
import findspark
findspark.init('/content/spark-2.3.2-bin-hadoop2.7/')
from pyspark import SparkContext
sc = SparkContext.getOrCreate()

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate() 
spark

Upvotes: 1

Related Questions