Why is .show() on a 20 row PySpark dataframe so slow?

Question

I am using PySpark in a Jupyter notebook. The following step takes up to 100 seconds, which is OK.

toydf = df.select("column_A").limit(20)

However, the following show() step takes 2-3 minutes. It only has 20 rows of lists of integers, and each list has no more than 60 elements. Why does it take so long?

toydf.show()

df is generated as follows:

spark = SparkSession.builder\
    .config(conf=conf)\
    .enableHiveSupport()\
    .getOrCreate()
df = spark.sql("""SELECT column_A
                        FROM datascience.email_aac1_pid_enl_pid_1702""")

Why is .show() on a 20 row PySpark dataframe so slow?

Answers (1)

Related Questions