how to use spark with python or jupyter notebook

I am trying to work with 12GB of data in python for which I desperately need to use Spark , but I guess I'm too stupid to use command line by myself or by using internet and that is why I guess I have to turn to SO ,

So by far I have downloaded the spark and unzipped the tar file or whatever that is ( sorry for the language but I am feeling stupid and out ) but now I can see nowhere to go. I have seen the instruction on spark website documentation and it says :

Spark also provides a Python API. To run Spark interactively in a Python interpreter, use bin/pyspark but where to do this ? please please help . Edit : I am using windows 10

Note:: I have always faced problems when trying to install something mainly because I can't seem to understand Command prompt

Upvotes: 3

Answers (3)

VenVig

Reputation: 915

I understand that you have already installed Spark in the windows 10.

You will need to have winutils.exe available as well. If you haven't already done so, download the file from http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe and install at say, C:\winutils\bin

Set up environment variables

HADOOP_HOME=C:\winutils
SPARK_HOME=C:\spark or wherever.
PYSPARK_DRIVER_PYTHON=ipython or jupyter notebook
PYSPARK_DRIVER_PYTHON_OPTS=notebook

Now navigate to the C:\Spark directory in a command prompt and type "pyspark"

Jupyter notebook will launch in a browser. Create a spark context and run a count command as shown.

Upvotes: 0

pauli

Reputation: 4301

If you are more familiar with jupyter notebook, you can install Apache Toree which integrates pyspark,scala,sql and SparkR kernels with Spark.

for installing toree

pip install toree
jupyter toree install --spark_home=path/to/your/spark_directory --interpreters=PySpark

if you want to install other kernels you can use

jupyter toree install --interpreters=SparkR,SQl,Scala

Now run

jupyter notebook

In the UI while selecting new notebook, you should see following kernels availble

Apache Toree-Pyspark
Apache Toree-SparkR
Apache Toree-SQL
Apache Toree-Scala

Upvotes: 3

gsamaras

Reputation: 73376

When you unzip the file, a directory is created.

Open a terminal.
Navigate to that directory with cd.
Do an ls. You will see its contents. bin must be placed somewhere.
Execute bin/pyspark or maybe ./bin/pyspark.

Of course, in practice it's not that simple, you may need to set some paths, like said in TutorialsPoint, but there are plenty of such links out there.

Upvotes: 1

how to use spark with python or jupyter notebook

Answers (3)

Related Questions