MagicHiggs
MagicHiggs

Reputation: 49

Why there is always a .collect() after a .distinct()?

Beginner for spark. Often see .distinct().collect() structure. What might be the intrinsic reason to have the collect() function right after the distinct() function?

Upvotes: 1

Views: 2302

Answers (3)

Piyush Patel
Piyush Patel

Reputation: 1751

There could be several reasons for this.

collect is an action. As mentioned above, Spark executes all the operations leading up to the beginning (if caching is not used) and return the result when you call collect. However, you saw the reason why you saw distinct before collect could be that collect returns the results to the driver program and to limit the result, it selects only distinct values and then returns to the driver. This way there will be less number of records to be fetched and it can avoid OutOfMemory errors.

There is no other reason why you see these two methods together. Please, note that it is just a coincidence that you saw these two together. I have never used them together in any of my projects yet.

Upvotes: 0

Zac Roberts
Zac Roberts

Reputation: 141

Spark leverages the concept of 'lazy evaluation'. Lazy evaluation means that Spark will wait until the very last moment to execute a graph of computation instructions, generally in order to look for ways to enhance the execution plan. Lazy evaluation includes the concepts of transformations and actions. A transformation (such as distinct(), sort(), sum()) will be noted by Spark and built into a logical plan. This plan is called a DAG (Directed Acyclic Graph). We need to call an action to get Spark to execute the DAG. Examples of actions include count(), show(), or collect(). Actions are basically anything that brings the result of our data transformations back to the native object in the respective language, in this case Python.

In the case of your example, Spark doesn't actually execute the DAG when you call distinct(). It executes the DAG when you call an action after distinct, such as distinct().collect() or distinct().show() or distinct().count(). Also, collect() is simply a function that returns the DataFrame as a Python List of Row objects as specified here... collect() . You can choose from other actions to follow distinct(), collect() is just a frequently used example in tutorials because it shows the structure of the Row in the dataset.

Upvotes: 3

Vitaly Olegovitch
Vitaly Olegovitch

Reputation: 3547

distinct is a transformation. This means that it is not executed immediately, but only when an action is called.

collect is an action. Calling the collect method causes all previous transformations to be run.

Outside Spark calling distinct after collect could increase the memory footprint of your program, because the program will generate also the duplicate elements. In Spark, calling distinct after collect could also make your entire program fail.

You can find more explanations here: https://dzone.com/articles/getting-lazy-with-scala

Upvotes: 6

Related Questions