Reputation: 6085
I have a dataframe from which I want to extract Maximum value, Minimum value and Count the number of records in it.
The dataframe is:
scala> val df = spark.range(10000)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
For getting the required values I am using df.select()
, like this:
scala> df.select(min("id"), max("id"), count("id")).show
+-------+-------+---------+
|min(id)|max(id)|count(id)|
+-------+-------+---------+
| 0| 9999| 10000|
+-------+-------+---------+
Which gives me correct results, but when I tried df.agg()
it also gave me same answer.
scala> df.agg(min("id"), max("id"), count("id")).show
+-------+-------+---------+
|min(id)|max(id)|count(id)|
+-------+-------+---------+
| 0| 9999| 10000|
+-------+-------+---------+
So, my question is what is the difference between df.select()
and df.agg()
if they provide the same results and which one should I use for better performance ?
Upvotes: 2
Views: 3356
Reputation: 41957
select
is used to select required columns from a dataframe
whereas agg
is used to aggregate groups of dataframe
applying some functions
on that group.
In your case, min
, max
and count
is performed on whole dataset
and both select
and agg
are performing the same task i.e. transforming the aggregated dataframe
to a new dataframe
Real difference will be evident when we have to perform the aggregations
on groups of data. You can perform agg
on grouped dataframe
but cannot perform select
on grouped dataframe
. select
query can be performed on whole dataset
that a pointer
is pointing.
If you checkout grouped dataframe, you can see definition as "A set of methods for aggregations on a DataFrame, created by DataFrame.groupBy. The main method is the agg function, which has multiple variants. This class also contains convenience some first order statistics such as mean, sum for convenience."
I hope the answer is clear
Upvotes: 5