summary
summary

Reputation: 125

How application 's work is distributed on to worker nodes in Apache Spark

I have an application that reads a file do some calculation and generates output file in driver machine. Now when i run it with a slave on machine A it takes 6 mins . If i add one more slave on machine B to same cluster and run driver program it takes 13 mins ( with few no route found to host machine B ) . I believe it is due to network latency delay . Least time with 2 workers is always higher than 1 worker . Then too some how i think , that application's work is not executing in distributed manner. Both the slaves are doing whole work independently. Both slaves read input file as a whole and create RDD and send to driver for output. I am wondering then where is the distributed computing for which Apache Spark is known for ? I have a small word count program , that only does computation and no File I/O is involved , if i run that with a huge file with multiple worker nodes , I see execution time decreases with the addition of a worker . I want to know is each worker reads full file and create RDD and no distributed work is happening in the program ?

Thanks much .

--edit PFA the screen shot with various worker nodes . Corresponding colored rectangle shows the execution output. I am wondering why addition of more workers delays the execution time . I see No route to host exception at time in log , but then why it doesn't come when i remove any one of the worker . Any pointers ? -- Thanks in advance.

Upvotes: 1

Views: 1206

Answers (1)

Arnon Rotem-Gal-Oz
Arnon Rotem-Gal-Oz

Reputation: 25909

You took a small dataset, put it on a file system that isn't distributed and ran that through an engine designed designed to run with hundreds of node - what can go wrong?

coordinating processes over a lot of computers requires a lot of coordination, sending data back and forth, serializing and deserializing etc. If you can't run a solution any other way the overhead is acceptable but if you run it on something small you are more affected by the overhead than the time that takes to solve the problem

Upvotes: 2

Related Questions