Reputation: 540
I'm having a pretty troubling problem with the LAST aggregate in SparkSQL in Spark 2.3.1. It seems to give me around 4 bad results -- that is, values that are not LAST by the specified partitioning and order -- in 500,000 (logical SQL, not Spark) partitions, something like 50MM records. Smaller batches are worse -- the number of errors per batch seems pretty consistent, although I don't think I tried anything smaller than 100,000 logical SQL partitions.
I have roughly 66 FIRST or LAST aggregates, a compound (date, integer) logical sql partition key and a compound (string, string) sort key. I tried converting the four-character numeric values into integers, then I combined them into a single integer. Neither of those moves resolved the problem. Even with a single integer sort key, I was getting a few bad values.
Typically, there are fewer than a hundred records in each partition, and a handful of non-NULL values for any field. It never seems to get the second to last value; it's always at least third to last.
I did try to replace the simple aggregate with a windowed aggregate with ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. The one run I did of that gave me six bad records -- the compound integer key had given me only two, but I didn't do enough runs to really compare the approaches and of course I need zero.
Why do I not seem to be able to rely on LAST()? Here's a test which just illustrates the unwindowed version of the LAST function, although my partitioning and sorting fields are each two fields.
import org.apache.spark.sql.functions.{expr}
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.scalamock.scalatest.MockFactory
import org.scalatest.{BeforeAndAfterAll, FlatSpec, Matchers}
import collection.JavaConverters._
class LastTest extends FlatSpec with Matchers with MockFactory with BeforeAndAfterAll {
implicit val spark: SparkSession = SparkSession.builder().appName("Last Test").master("local[2]").getOrCreate()
import spark.implicits._
// TRN_DATE, TRN_NUMBER, TRN_TIMESTAMP, DETAILS, DATE_TIME, QUE_LINE_ID, OPR_INITIALS, ENTRY_TYPE, HIST_NO, SUB_HIST_NO, MSG_INFO
"LAST" must "work with GROUP BY" in {
val lastSchema = StructType(Seq(
StructField("Pfield", IntegerType) // partition field
, StructField("Ofield", IntegerType) // order field
, StructField("Vfield", StringType) // value field
))
val last:DataFrame = spark.createDataFrame(List[Row](
Row(0, 1, "Pencil")
, Row(5, 1, "Aardvark")
, Row(10, 1, "Monastery")
, Row(10, 2, "Remediation")
, Row(15, 1, "Parcifal")
, Row(20, 1, "Montenegro")
, Row(20, 2, "Susquehana")
, Row(20, 3, "Perfidy")
, Row(20, 4, "Prosody")
).asJava
, lastSchema
).repartition(expr("MOD(Pfield, 4)"))
last.createOrReplaceTempView("last_group_test")
// apply the unwindowed last
val unwindowed:DataFrame = spark.sql("SELECT Pfield, LAST(Vfield) AS Vlast FROM (SELECT * FROM last_group_test ORDER BY Pfield, Ofield) GROUP BY Pfield ORDER BY Pfield")
unwindowed.show(5)
// apply a windowed last
val windowed:DataFrame = spark.sql("SELECT DISTINCT Pfield, LAST(Vfield) OVER (PARTITION BY Pfield ORDER BY Ofield ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS Vlast FROM last_group_test ORDER BY Pfield")
windowed.show(5)
// include the partitioning function in the window
val excessivelyWindowed:DataFrame = spark.sql("SELECT DISTINCT Pfield, LAST(Vfield) OVER (PARTITION BY MOD(Pfield, 4), Pfield ORDER BY Ofield ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS Vlast FROM last_group_test ORDER BY Pfield")
excessivelyWindowed.show(5)
assert(unwindowed.collect() === windowed.collect() && windowed.collect() === excessivelyWindowed.collect())
assert(windowed.count() == 5)
assert(windowed.filter("Pfield=20").select($"Vlast").collect()(0)(0)==="Prosody")
}
}
So, all three datasets are the same, which is nice. But, if I apply this logic to my actual needs -- which has sixty-odd columns, almost all of which are LAST values -- I'll get an error, it looks like about 4 times in a batch of 500,000 groups. If I run the dataset 30 times, I'll get 30 different sets of bad records.
Am I doing something wrong, or is this a defect? Is it a known defect? Is it fixed in 2.4? I didn't see if, but "aggregates simply don't work sometimes" can't be something they released with, right?
Upvotes: 1
Views: 440
Reputation: 540
I was able to resolve the issue by applying with windowed aggregate to a dataset with the same sorting, sorted in a subquery.
SELECT LAST(VAL) FROM (SELEcT * FROM TBL ORDER BY SRT) SRC GROUP BY PRT
was not sufficient, nor was
SELECT LAST(VAL) OVER (PARTITION BY PRT ORDER BY SRT ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM TBL
I had to do both
SELECT DISTINCT LAST(VAL) OVER (PARTITION BY PRT ORDER BY SRT ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) FROM (SELEcT * FROM TBL ORDER BY SRT) SRC
These datasets had been extracted from an Oracle 12.2 instance over JDBC. I also added SRT to the order by clause there, which had just had ORDER BY PRT.
Further -- and I think this may have been most significant -- I used the cacheTable API on the spark catalog object after extracting the data. I had been doing
.repartition
.cache
.count
in order to load all the records with a relatively small number of data connections, but I suspect it was not enough to get all the data sparkside before the aggregations took place.
Upvotes: 0