Reputation: 123
I have three matrices (A, B and C) as individual RDDs and I need to partition them among worked nodes, as blocks of matrices. The action I perform needs to update the blocks of matrices but I need to synchronize on the blocks of matrices so that two worker nodes dont update the same matrix block at the same time. How can I achieve this synchronization. Is there mechanism for locking? I am very new to Spark (PySpark).
Is it possible to control how partitioning is done by Spark, i.e. to control which block is sent to which worked node?
Please help.
Upvotes: 0
Views: 1031
Reputation: 330193
Technically it completely doesn't matter. There is no such thing a shared, mutable state in Spark (one could argue that this is the case with accumulators
but lets not dwell on that). It means there is no situation where computation can modify shared state and any type of locks would be required.
This is a little bit more complicated on JVM but PySpark architecture provides complete isolation between workerss so unless you go outside Spark your safe. If you do it is your responsibility to handle the conflicts using context specific methods.
Finally if you try to modify data (please don't mix it with RDDs) in place it is simply a programming mistake. It can lead to some really ugly things on JVM, but once again should have no visible effect on PySpark (this is just a matter of implementation not a contract). Every change should be expressed using transformations and, as long as not specified otherwise (see for example fold
or aggregate
family), shouldn't modify existing data.
Upvotes: 1