Reputation: 3456
I am using PySpark in a Jupyter notebook. The following step takes up to 100 seconds, which is OK.
toydf = df.select("column_A").limit(20)
However, the following show()
step takes 2-3 minutes. It only has 20 rows of lists of integers, and each list has no more than 60 elements. Why does it take so long?
toydf.show()
df
is generated as follows:
spark = SparkSession.builder\
.config(conf=conf)\
.enableHiveSupport()\
.getOrCreate()
df = spark.sql("""SELECT column_A
FROM datascience.email_aac1_pid_enl_pid_1702""")
Upvotes: 8
Views: 8552
Reputation: 695
In spark there are two major concepts:
1: Transformations: whenever you apply withColumn, drop, joins or groupBy they are actually evaluating they just produce a new dataframe or RDD.
2: Actions: Rather in case of actions like count, show, display, write it actually doing all the work of transformations. and this all Actions internally call Spark RunJob API to run all transformation as Job.
And in your case case when you hit toydf = df.select("column_A").limit(20)
nothing is happing.
But when you use Show()
method which is an action so it will collect data from the cluster to your Driver node and on this time it actually evaluating your toydf = df.select("column_A").limit(20)
.
Upvotes: 2