How Python data structure implemented in Spark when using PySpark?

Question

I am currently self-learning Spark programming and trying to recode an existing Python application in PySpark. However, I am still confused about how we use regular Python objects in PySpark.

I understand the distributed data structure in Spark such as the RDD, DataFrame, Datasets, vector, etc. Spark has its own transformation operations and action operations such as .map(), .reduceByKey() to manipulate those objects. However, what if I create traditional Python data objects such as array, list, tuple, or dictionary in PySpark? They will be only stored in the memory of my driver program node, right? If I transform them into RDD, can i still do operations with typical Python function?

If I have a huge dataset, can I use regular Python libraries like pandas or numpy to process it in PySpark? Will Spark only use the driver node to run the data if I directly execute Python function on a Python object in PySpark? Or I have to create it in RDD and use Spark's operations?

User12345 · Accepted Answer

You can create traditional Python data objects such as array, list, tuple, or dictionary in PySpark.

You can perform most of the operations using python functions in Pyspark.

You can import Python libraries in Pyspark and use them to process data in Pyspark

You can create a RDD and apply spark operations on them

How Python data structure implemented in Spark when using PySpark?

Answers (1)

Related Questions