What is the difference between df.select() and df.agg()?

Question

I have a dataframe from which I want to extract Maximum value, Minimum value and Count the number of records in it.

The dataframe is:

scala> val df = spark.range(10000)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]

For getting the required values I am using df.select(), like this:

scala> df.select(min("id"), max("id"), count("id")).show
+-------+-------+---------+
|min(id)|max(id)|count(id)|
+-------+-------+---------+
|      0|   9999|    10000|
+-------+-------+---------+

Which gives me correct results, but when I tried df.agg() it also gave me same answer.

scala> df.agg(min("id"), max("id"), count("id")).show
+-------+-------+---------+
|min(id)|max(id)|count(id)|
+-------+-------+---------+
|      0|   9999|    10000|
+-------+-------+---------+

So, my question is what is the difference between df.select() and df.agg() if they provide the same results and which one should I use for better performance ?

Ramesh Maharjan · Accepted Answer

select is used to select required columns from a dataframe whereas agg is used to aggregate groups of dataframe applying some functions on that group.

In your case, min, max and count is performed on whole dataset and both select and agg are performing the same task i.e. transforming the aggregated dataframe to a new dataframe

Real difference will be evident when we have to perform the aggregations on groups of data. You can perform agg on grouped dataframe but cannot perform select on grouped dataframe. select query can be performed on whole dataset that a pointer is pointing.

If you checkout grouped dataframe, you can see definition as "A set of methods for aggregations on a DataFrame, created by DataFrame.groupBy. The main method is the agg function, which has multiple variants. This class also contains convenience some first order statistics such as mean, sum for convenience."

I hope the answer is clear

What is the difference between df.select() and df.agg()?

Answers (1)

Related Questions