Reputation: 20856
I use a spark streaming job to process my input request.
My spark input takes a filename, downloads the data, make some changes and sends the data to downstream.
Currently, it takes 2 mins to process one single file.
These file requests are independent operation and can be tasked parallel.
Currently, when I give my input through netcat server, each request is processed first and then next request is processed. I want this operation to be parallel.
@timing
def sleep_func(data):
print("start file processing")
time.sleep(60)
print("end file processing")
return data
rdd = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
rdd = rdd.map(sleep_func)
final_rects = rdd.pprint()
Im trying to create multiple sockettextstream that will be processed in each executor based on this.
https://spark.apache.org/docs/2.0.2/streaming-programming-guide.html#level-of-parallelism-in-data-receiving
rdd = [ssc.socketTextStream(sys.argv[1], int(sys.argv[2])) for _ in range(5)]
but not sure how to process these individual streams separately.
Upvotes: 1
Views: 1263
Reputation: 4719
You mean you want to process batch of data parallel instead of one by one, right?
see: How jobs are assigned to executors in Spark Streaming?
Upvotes: 0