Ferrard
Ferrard

Reputation: 2538

Can PySpark work without Spark?

I have installed PySpark standalone/locally (on Windows) using

pip install pyspark

I was a bit surprised I can already run pyspark in command line or use it in Jupyter Notebooks and that it does not need a proper Spark installation (e.g. I did not have to do most of the steps in this tutorial https://medium.com/@GalarnykMichael/install-spark-on-windows-pyspark-4498a5d8d66c ).

Most of the tutorials that I run into say one needs to "install Spark before installing PySpark". That would agree with my view of PySpark being basically a wrapper over Spark. But maybe I am wrong here - can someone explain:

Upvotes: 49

Views: 22102

Answers (3)

qwr
qwr

Reputation: 10891

PySpark has a Spark installation installed. If installed through pip3, you can find it with pip3 show pyspark. Ex. for me it is at ~/.local/lib/python3.8/site-packages/pyspark.

This is a standalone configuration so it can't be used for managing clusters like a full Spark installation.

Upvotes: 4

Livmortis
Livmortis

Reputation: 151

PySpark installed by pip is a subfolder of full Spark. you can find most of PySpark python file in spark-3.0.0-bin-hadoop3.2/python/pyspark. so if you'd like to use java or scala interface, and deploy distribute system with hadoop, you must download full Spark from Apache Spark and install it.

Upvotes: 8

Kirk Broadhurst
Kirk Broadhurst

Reputation: 28698

As of v2.2, executing pip install pyspark will install Spark.

If you're going to use Pyspark it's clearly the simplest way to get started.

On my system Spark is installed inside my virtual environment (miniconda) at lib/python3.6/site-packages/pyspark/jars

Upvotes: 43

Related Questions