Reputation: 2487
What would be the fundamental difference with using .GroupByKey
and .GroupBy
when I am using a column name of a DF as a parameter?
Which one is time efficient and how exactly does each mean can someone please explain in detail as I went through some examples but it was confusing.
Upvotes: 0
Views: 4160
Reputation:
There is no groupByKey
method that takes Column
as an argument. There are methods which take functions, either:
def groupByKey[K](func: MapFunction[T, K], encoder: Encoder[K]): KeyValueGroupedDataset[K, T]
or
def groupByKey[K](func: (T) ⇒ K)(implicit arg0: Encoder[K]): KeyValueGroupedDataset[K, T]
Compared to groupBy
that takes Columns
:
def groupBy(cols: Column*): RelationalGroupedDataset
or String
def groupBy(col1: String, cols: String*): RelationalGroupedDataset
the difference should be obvious - the first two return KeyValueGroupedDataset
(intended for processing with "functional", "strongly typed API, like mapGroups
or reduceGroups), while the later methods return
RelationalGroupedDataset` (intended for processing with SQL-like API).
In general see:
Upvotes: 3