Reputation: 841
Is there a difference in the output between sorting before or after the .write
command on a DataFrame?
val people : DataFrame[Person]
people
.orderBy("name")
.write
.mode(SaveMode.Append)
.format("parquet")
.saveAsTable("test_segments")
and
val people : DataFrame[Person]
people
.write
.sortBy("name")
.mode(SaveMode.Append)
.format("parquet")
.saveAsTable("test_segments")
Upvotes: 3
Views: 840
Reputation: 18515
The different between those is explained on the comments within the code:
The sortBy
method will only work when you are also defining buckets (bucketBy
). Otherwise you will get an exception:
if (sortColumnNames.isDefined && numBuckets.isEmpty) {
throw new AnalysisException("sortBy must be used together with bucketBy")
}
The columns defined in sortBy
are used in the BucketSpec as sortColumnNames
like shown below:
Params:
numBuckets – number of buckets.
bucketColumnNames – the names of the columns that used to generate the bucket id.
sortColumnNames – the names of the columns that used to sort data in each bucket.
case class BucketSpec(
numBuckets: Int,
bucketColumnNames: Seq[String],
sortColumnNames: Seq[String])
Upvotes: 4