eTothEipiPlus1
eTothEipiPlus1

Reputation: 587

Small Spark dataframe very slow in Databricks

I'm using a pretty standard Databricks cluster (2 nodes with 14 GB memory, 4 cores, 0.75 DBU). I have a function spark_shape defined as

def spark_shape(df):
  """Returns (rows, columns) 
  """
  return (df.count(), len(df.columns))

And a simple dataframe df that is only of shape (590, 2). However running spark_shape(df) takes over 6 minutes! I'm wondering if I need to increase the memory or nodes Databricks cluster except this dataframe is so small I don't understand why a simple operation would take this long. Any ideas?

Upvotes: 1

Views: 4800

Answers (2)

João Barboza
João Barboza

Reputation: 19

This function have a very complicated issue.

You can´t know how many time spark evaluation will take to process the dataframe.

Spark architecture does not process things everytime. The Lazy Evaluation only defines what it will need to do to complete the transformations to achieve the desired result. Operations like Join, Merde, Filter, Where are not applied when the code is executed. The operations will only be applied whem some action is used (Count, save, write, display).

Those factors explain why this function is taking so much time. The dataframe is generated inside it, because it has never been fully compiled. You can force this execution saving the df, applying a checkpoint, or using persist (And applying some action, cause persist and cache are also considered transformations that will only be applied when some action is executed).

Those actions will assure the df is compiled and you only exeuted the shape inside this function.

Upvotes: -1

eTothEipiPlus1
eTothEipiPlus1

Reputation: 587

Review Spark's query plan to understand how Spark will read and process the data.

df.explain()

If you have done a lot of preprocessing to df, the query plan can get ugly fast. My solution is to save the preprocessed df to a new (parquet) file (kind of like a staging table) and then bring that file in as a PySpark Dataframe to do any further operations on the data. This may not be the only or best solution but it's all I've found that works so far.

Upvotes: 1

Related Questions