Difference between GroupByKey($"col") and GroupBy($"col") in spark scala

Question

What would be the fundamental difference with using .GroupByKey and .GroupBy when I am using a column name of a DF as a parameter?

Which one is time efficient and how exactly does each mean can someone please explain in detail as I went through some examples but it was confusing.

user10546212 · Accepted Answer

There is no groupByKey method that takes Column as an argument. There are methods which take functions, either:

def groupByKey[K](func: MapFunction[T, K], encoder: Encoder[K]): KeyValueGroupedDataset[K, T]

or

def groupByKey[K](func: (T) ⇒ K)(implicit arg0: Encoder[K]): KeyValueGroupedDataset[K, T]

Compared to groupBy that takes Columns:

def groupBy(cols: Column*): RelationalGroupedDataset

or String

def groupBy(col1: String, cols: String*): RelationalGroupedDataset

the difference should be obvious - the first two return KeyValueGroupedDataset (intended for processing with "functional", "strongly typed API, like mapGroups or reduceGroups), while the later methods returnRelationalGroupedDataset` (intended for processing with SQL-like API).

In general see:

Difference between GroupByKey($"col") and GroupBy($"col") in spark scala

Answers (1)

Related Questions

Difference between GroupByKey($&quot;col&quot;) and GroupBy($&quot;col&quot;) in spark scala

Answers (1)

Related Questions

Difference between GroupByKey($"col") and GroupBy($"col") in spark scala