Way to use previous data with current data in pyspark with kafka stream

Question

I am sending dict objects from my producer and using pyspark to create a new object. But the kind of obj I want to form requires the key, value pair of previous data also. I tried window batching and reduceByKey but none of them seems to work.

Suppose my producer object is like a list of "url_id" and "url" pair. For ex.{"url_id": "google.com"} and in spark i want to form an object like: {"data": {"url_id": "url", "url_id_of_previous_url": "url",....and so on}

My spark code is:

conf = SparkConf().setAppName(appName).setMaster("local[*]")
        sc = SparkContext(conf=conf)

        stream_context = StreamingContext(sparkContext=sc, batchDuration=batchTime)
        kafka_stream = KafkaUtils.createDirectStream(ssc=stream_context, topics=[topic], 
                                          kafkaParams={"metadata.broker.list":"localhost:9092", 
                                                     'auto.offset.reset':'smallest'})
        lines = kafka_stream.map(lambda x: json.loads(x[1]))

I am stuck after this. Can u tell me if forming such obj is possible or not with spark? And if it is then what should I use?

Way to use previous data with current data in pyspark with kafka stream

Answers (1)

Related Questions