LAST (not NULL) gives incorrect answers on a handful for records (Apache Spark SparkSQL)

Question

I'm having a pretty troubling problem with the LAST aggregate in SparkSQL in Spark 2.3.1. It seems to give me around 4 bad results -- that is, values that are not LAST by the specified partitioning and order -- in 500,000 (logical SQL, not Spark) partitions, something like 50MM records. Smaller batches are worse -- the number of errors per batch seems pretty consistent, although I don't think I tried anything smaller than 100,000 logical SQL partitions.

I have roughly 66 FIRST or LAST aggregates, a compound (date, integer) logical sql partition key and a compound (string, string) sort key. I tried converting the four-character numeric values into integers, then I combined them into a single integer. Neither of those moves resolved the problem. Even with a single integer sort key, I was getting a few bad values.

Typically, there are fewer than a hundred records in each partition, and a handful of non-NULL values for any field. It never seems to get the second to last value; it's always at least third to last.

I did try to replace the simple aggregate with a windowed aggregate with ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. The one run I did of that gave me six bad records -- the compound integer key had given me only two, but I didn't do enough runs to really compare the approaches and of course I need zero.

Why do I not seem to be able to rely on LAST()? Here's a test which just illustrates the unwindowed version of the LAST function, although my partitioning and sorting fields are each two fields.

import org.apache.spark.sql.functions.{expr}
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.scalamock.scalatest.MockFactory
import org.scalatest.{BeforeAndAfterAll, FlatSpec, Matchers}

import collection.JavaConverters._

class LastTest extends FlatSpec with Matchers with MockFactory with BeforeAndAfterAll {

  implicit val spark: SparkSession = SparkSession.builder().appName("Last Test").master("local[2]").getOrCreate()
  import spark.implicits._

  // TRN_DATE, TRN_NUMBER, TRN_TIMESTAMP, DETAILS, DATE_TIME, QUE_LINE_ID, OPR_INITIALS, ENTRY_TYPE, HIST_NO, SUB_HIST_NO, MSG_INFO
  "LAST" must "work with GROUP BY" in {
    val lastSchema = StructType(Seq(
      StructField("Pfield", IntegerType) // partition field
      , StructField("Ofield", IntegerType) // order field
      , StructField("Vfield", StringType) // value field
    ))
    val last:DataFrame = spark.createDataFrame(List[Row](
      Row(0, 1, "Pencil")
      , Row(5, 1, "Aardvark")
      , Row(10, 1, "Monastery")
      , Row(10, 2, "Remediation")
      , Row(15, 1, "Parcifal")
      , Row(20, 1, "Montenegro")
      , Row(20, 2, "Susquehana")
      , Row(20, 3, "Perfidy")
      , Row(20, 4, "Prosody")
    ).asJava
      , lastSchema
    ).repartition(expr("MOD(Pfield, 4)"))
    last.createOrReplaceTempView("last_group_test")
    // apply the unwindowed last
    val unwindowed:DataFrame = spark.sql("SELECT Pfield, LAST(Vfield) AS Vlast FROM (SELECT * FROM last_group_test ORDER BY Pfield, Ofield) GROUP BY Pfield ORDER BY Pfield")
    unwindowed.show(5)
    // apply a windowed last
    val windowed:DataFrame = spark.sql("SELECT DISTINCT Pfield, LAST(Vfield) OVER (PARTITION BY Pfield ORDER BY Ofield ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS Vlast FROM last_group_test ORDER BY Pfield")
    windowed.show(5)
    // include the partitioning function in the window
    val excessivelyWindowed:DataFrame = spark.sql("SELECT DISTINCT Pfield, LAST(Vfield) OVER (PARTITION BY MOD(Pfield, 4), Pfield ORDER BY Ofield ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS Vlast FROM last_group_test ORDER BY Pfield")
    excessivelyWindowed.show(5)
    assert(unwindowed.collect() === windowed.collect() && windowed.collect() === excessivelyWindowed.collect())
    assert(windowed.count() == 5)
    assert(windowed.filter("Pfield=20").select($"Vlast").collect()(0)(0)==="Prosody")
  }
}

So, all three datasets are the same, which is nice. But, if I apply this logic to my actual needs -- which has sixty-odd columns, almost all of which are LAST values -- I'll get an error, it looks like about 4 times in a batch of 500,000 groups. If I run the dataset 30 times, I'll get 30 different sets of bad records.

Am I doing something wrong, or is this a defect? Is it a known defect? Is it fixed in 2.4? I didn't see if, but "aggregates simply don't work sometimes" can't be something they released with, right?

LAST (not NULL) gives incorrect answers on a handful for records (Apache Spark SparkSQL)

Answers (1)

Related Questions