RafaJM
RafaJM

Reputation: 489

No module named 'pyspark' when running Jupyter notebook inside EMR

I am (very) new to AWS and Spark in general, and I'm trying to run a notebook instance in Amazon EMR. When I try to import pyspark to start a session and load data from s3, I get the error No module named 'pyspark'. The cluster I created had the Spark option filled, what am I doing wrong?

Upvotes: 4

Views: 7055

Answers (3)

Anoop George c
Anoop George c

Reputation: 1

You Could try using the findspark library. Could pip install findspark and below code in your jupyter.

import findspark
findspark.init()

%load_ext sparksql_magic
%config SparkSql.limit=200

Upvotes: 0

naren
naren

Reputation: 15233

You can open jupyter lab notebook and select new spark notebook from there. This will initiate the spark context automatically for you.

enter image description here

Or you can open Jupyter notebook and load spark app by %%spark

enter image description here

Upvotes: 1

RafaJM
RafaJM

Reputation: 489

The only solution that worked for me was to change the notebook kernel to the PySpark kernel, then changing the bootstrap action to install packages (in python version3.6) that are not by default in the pyspark kernel:

#!/bin/bash
sudo python3.6 -m pip install numpy \
    matplotlib \
    pandas \
    seaborn \
    pyspark

Apparently by default it installs to python 2.7.16, so it outputs no error message but you can't import the modules because the spark env uses Python 2.7.16.

Upvotes: 4

Related Questions