Pyspark data aggregation with Window and sliding interval on index

Question

I am currently running into the issue where I want to use a window and sliding interval on my csv and for each window perform data aggregation to get the most common category. However I do not have a timestamp and I want to perform the window sliding on the index column. Can anyone point me in the right direction on how to use windows + sliding intervals on the index?

In short i want to create windows+intervals over the index column.

Currently I have something like this:

schema = StructType().add("index", "string").add(
    "Category", "integer")
                                                                             
dataframe = spark \
    .readStream \
    .option("sep", ",") \
    .schema(schema) \
    .csv("./tmp/input")

# TODO perform Window + sliding interval on dataframe, then perform aggregation per window
aggr = dataframe.groupBy("Category").count().orderBy("count", ascending=False).limit(3)

query = aggr \
    .writeStream \
    .outputMode("complete") \
    .format("console") \
    .start()

query.awaitTermination()

Pyspark data aggregation with Window and sliding interval on index

Answers (1)

Related Questions