user2205916
user2205916

Reputation: 3456

Why is .show() on a 20 row PySpark dataframe so slow?

I am using PySpark in a Jupyter notebook. The following step takes up to 100 seconds, which is OK.

toydf = df.select("column_A").limit(20)

However, the following show() step takes 2-3 minutes. It only has 20 rows of lists of integers, and each list has no more than 60 elements. Why does it take so long?

toydf.show()

df is generated as follows:

spark = SparkSession.builder\
    .config(conf=conf)\
    .enableHiveSupport()\
    .getOrCreate()
df = spark.sql("""SELECT column_A
                        FROM datascience.email_aac1_pid_enl_pid_1702""")

Upvotes: 8

Views: 8552

Answers (1)

code.gsoni
code.gsoni

Reputation: 695

In spark there are two major concepts:

1: Transformations: whenever you apply withColumn, drop, joins or groupBy they are actually evaluating they just produce a new dataframe or RDD.

2: Actions: Rather in case of actions like count, show, display, write it actually doing all the work of transformations. and this all Actions internally call Spark RunJob API to run all transformation as Job.

And in your case case when you hit toydf = df.select("column_A").limit(20) nothing is happing.

But when you use Show() method which is an action so it will collect data from the cluster to your Driver node and on this time it actually evaluating your toydf = df.select("column_A").limit(20).

Upvotes: 2

Related Questions