Reputation: 489
I am (very) new to AWS and Spark in general, and I'm trying to run a notebook instance in Amazon EMR. When I try to import pyspark to start a session and load data from s3, I get the error No module named 'pyspark'. The cluster I created had the Spark option filled, what am I doing wrong?
Upvotes: 4
Views: 7055
Reputation: 1
You Could try using the findspark library. Could pip install findspark and below code in your jupyter.
import findspark
findspark.init()
%load_ext sparksql_magic
%config SparkSql.limit=200
Upvotes: 0
Reputation: 15233
You can open jupyter lab notebook and select new spark notebook from there. This will initiate the spark context automatically for you.
Or you can open Jupyter notebook and load spark app by %%spark
Upvotes: 1
Reputation: 489
The only solution that worked for me was to change the notebook kernel to the PySpark kernel, then changing the bootstrap action to install packages (in python version3.6) that are not by default in the pyspark kernel:
#!/bin/bash
sudo python3.6 -m pip install numpy \
matplotlib \
pandas \
seaborn \
pyspark
Apparently by default it installs to python 2.7.16, so it outputs no error message but you can't import the modules because the spark env uses Python 2.7.16.
Upvotes: 4