Reputation: 1329
I need to solve something similar to the equation Ax=b, in which A is an extremely large (but sparse) matrix. (It's 97% sparse; but its size is approximately 1,000,000,000 x 33,000,000)
Without getting bogged down into unnecessary details, the solution is effectively an iterative solution in which one “guesses X”, and then one plugs back the guessed “X” Back into the equation, evaluates it, and then re-determines the “next version of X”. The matrix A and the vector “b” don't change in these iterations. (There may be anywhere from 1,000 to 100,000 iterations).
My concern is how to write this iterative logic in such a manner that we don't keep copying the matrix A ( and even the vector B) in every iteration. (In theory, even vector “x” should not need to be re-instantiated in every iteration, but that cost is pretty minor.) If this were a single threaded application, it's quite easy to implement the logic. But I am unsure how to do this efficiently in the dataflow architecture. ( I instinctively feel that perhaps “side inputs” may be helpful here; or, instead of passing portions of the matrix A back-and-forth and recopy the data, maybe one just passes a pointer to the object?)
Upvotes: 1
Views: 188
Reputation: 3010
In general, you will end up copying data since different workers will need different parts of the overall data at different times. If you can factor your overall problem into small enough pieces that each can be iterated on in-memory, you may be able to reduce the communication cost. Additionally, if it's possible to express each iteration as just a delta update such that the delta is even sparser than the data, you may also be able to improve things.
Upvotes: 1