spark streaming processing parallelism

Question

I use a spark streaming job to process my input request.

My spark input takes a filename, downloads the data, make some changes and sends the data to downstream.

Currently, it takes 2 mins to process one single file.

These file requests are independent operation and can be tasked parallel.

Currently, when I give my input through netcat server, each request is processed first and then next request is processed. I want this operation to be parallel.

@timing
def sleep_func(data):
    print("start file processing")            
    time.sleep(60)      
    print("end file processing")                  
    return data

rdd = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))     
rdd = rdd.map(sleep_func)    
final_rects = rdd.pprint()

Im trying to create multiple sockettextstream that will be processed in each executor based on this.

https://spark.apache.org/docs/2.0.2/streaming-programming-guide.html#level-of-parallelism-in-data-receiving

rdd = [ssc.socketTextStream(sys.argv[1], int(sys.argv[2])) for _ in range(5)]

but not sure how to process these individual streams separately.

Zhang Tong · Accepted Answer

You mean you want to process batch of data parallel instead of one by one, right?

see: How jobs are assigned to executors in Spark Streaming?

spark streaming processing parallelism

Answers (1)

Related Questions