Word Count with timestamp in Python

Question

This example is extracted from Structured Streaming Programming Guide of Spark:

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

spark = SparkSession \
        .builder \
        .appName("StructuredNetworkWordCount") \
        .getOrCreate()

# Create DataFrame representing the stream of input lines from connection to localhost:9999
   lines = spark \
     .readStream \
     .format("socket") \
     .option("host", "localhost") \
     .option("port", 9999) \
     .load()

# Split the lines into words
  words = lines.select(
    explode(
       split(lines.value, " ")
       ).alias("word"),
       lines.timestamp.alias('time')
)

# Generate running word count
 wordCounts = words.groupBy("word").count() #line to modify

# Start running the query that prints the running counts to the console
query = wordCounts \
      .writeStream \
      .outputMode("complete") \
      .format("console") \
      .start()

query.awaitTermination()

I need to create a table with every word and its input time. The output table should be like this:

+-------+--------------------+
|word   |              time  |
+-------+--------------------+
|   car |2021-12-16  12:21:..|
+-------+--------------------+

How can I do it? I think the line marked with "#line to modify" is only the line to modify.

Word Count with timestamp in Python

Answers (1)

Related Questions