Sam
Sam

Reputation: 841

Spark DataFrame orderBy and DataFrameWriter sortBy, is there a difference?

Is there a difference in the output between sorting before or after the .write command on a DataFrame?

val people : DataFrame[Person]

people
        .orderBy("name")
        .write
        .mode(SaveMode.Append)
        .format("parquet")
        .saveAsTable("test_segments") 

and

val people : DataFrame[Person]

people
        .write
        .sortBy("name")
        .mode(SaveMode.Append)
        .format("parquet")
        .saveAsTable("test_segments") 

Upvotes: 3

Views: 840

Answers (1)

Michael Heil
Michael Heil

Reputation: 18515

The different between those is explained on the comments within the code:

  • orderBy: Is a Dataset/Dataframe operation. Returns a new Dataset sorted by the given expressions. This is an alias of the sort function.
  • sortBy: Is a DataFrameWriter operation. Sorts the output in each bucket by the given columns.

The sortBy method will only work when you are also defining buckets (bucketBy). Otherwise you will get an exception:

if (sortColumnNames.isDefined && numBuckets.isEmpty) {
  throw new AnalysisException("sortBy must be used together with bucketBy")
}

The columns defined in sortBy are used in the BucketSpec as sortColumnNames like shown below:

Params:
numBuckets – number of buckets.
bucketColumnNames – the names of the columns that used to generate the bucket id.
sortColumnNames – the names of the columns that used to sort data in each bucket.

case class BucketSpec(
    numBuckets: Int,
    bucketColumnNames: Seq[String],
    sortColumnNames: Seq[String])

Upvotes: 4

Related Questions