how to use spark sql udaf to implement window counting with condition?

Question

I have a table with columns: timestamp and id and condition, and I want to count the number of each id per interval such as 10 seconds.

If condition is true, the count++， otherwise return the previous value.

the udaf code like:

public class MyCount extends UserDefinedAggregateFunction {

    @Override
    public StructType inputSchema() {
        return DataTypes.createStructType(
                Arrays.asList(
                        DataTypes.createStructField("condition", DataTypes.BooleanType, true),
                        DataTypes.createStructField("timestamp", DataTypes.LongType, true),
                        DataTypes.createStructField("interval", DataTypes.IntegerType, true)
                )
        );
    }

    @Override
    public StructType bufferSchema() {
        return DataTypes.createStructType(
                Arrays.asList(
                        DataTypes.createStructField("timestamp", DataTypes.LongType, true),
                        DataTypes.createStructField("count", DataTypes.LongType, true)
                )
        );
    }

    @Override
    public DataType dataType() {
        return DataTypes.LongType;
    }

    @Override
    public boolean deterministic() {
        return true;
    }

    @Override
    public void initialize(MutableAggregationBuffer mutableAggregationBuffer) {
        mutableAggregationBuffer.update(0, 0L);
        mutableAggregationBuffer.update(1, 0L);
    }

    public void update(MutableAggregationBuffer mutableAggregationBuffer, Row row) {
        long timestamp = mutableAggregationBuffer.getLong(0);
        long count = mutableAggregationBuffer.getLong(1);
        long event_time = row.getLong(1);
        int interval = row.getInt(2);
        if (event_time > timestamp + interval) {
            timestamp = event_time - event_time % interval;
            count = 0;
        }
        if (row.getBoolean(0)) {
            count++;
        }
        mutableAggregationBuffer.update(0, timestamp);
        mutableAggregationBuffer.update(1, count);
    }
    
    @Override
    public void merge(MutableAggregationBuffer mutableAggregationBuffer, Row row) {

    }

    @Override
    public Object evaluate(Row row) {
        return row.getLong(1);
    }
}

Then I sumbit a sql like:

select timestamp, id, MyCount(true, timestamp, 10) over(PARTITION BY id ORDER BY timestamp) as count from xxx.xxx

the result is:

timestamp    id     count
1642760594    0        1
1642760596    0        2
1642760599    0        3
1642760610    0        2 --duplicate
1642760610    0        2
1642760613    0        3
1642760594    1        1
1642760597    1        2
1642760600    1        1
1642760603    1        2
1642760606    1        4 --duplicate
1642760606    1        4
1642760608    1        5

When the timestamp is repeated, I get 1,2,4,4,5 instead of 1,2,3,4,5 How to fix it？

And another requestion is that when to execute the merge method of udaf? I empty implement it but it runs normally. I try to add the log in the method but I haven't seen this log. Is it really necessary？

There is a similar question: Apache Spark SQL UDAF over window showing odd behaviour with duplicate input

However, row_number() does not have such a problem. row_number() is a hive udaf, then I try to create a hive udaf. But I also have the problem...Why hive udaf row_number() terminate() returns 'ArrayList'? I create my udaf row_number2() by copying its code then I got list return?

hwg2529 · Accepted Answer

Finally I solved it by spark aggregateWindowFunction:

case class Count(condition: Expression) extends AggregateWindowFunction with Logging {

  override def prettyName: String = "myCount"

  override def dataType: DataType = LongType

  override def children: Seq[Expression] = Seq(condition)

  private val zero = Literal(0L)
  private val one = Literal(1L)

  private val count = AttributeReference("count", LongType, nullable = false)()

  private val increaseCount = If(condition, Add(count, one), count)

  override val initialValues: Seq[Expression] = zero :: Nil
  override val updateExpressions: Seq[Expression] = increaseCount :: Nil
  override val evaluateExpression: Expression = count

  override val aggBufferAttributes: Seq[AttributeReference] = count :: Nil

Then use spark_session.functionRegistry.registerFunction to register it.

"select myCount(true) over(partition by window(timestamp, '10 seconds'), id order by timestamp) as count from xxx"

how to use spark sql udaf to implement window counting with condition?

Answers (1)

Related Questions