Reputation: 299
Context:
We tried:
Changing the memory allowed to executor/driver worked. It worked only for 10k or 100k rows by key. What about millions of rows by key that could happend in the future.
It seems that there is some work on that kind of issues : https://github.com/apache/spark/pull/1977
But it's specific for PySpark and not for the Scala API that we used currently
My questions are:
Upvotes: 0
Views: 524
Reputation: 66886
I think the change in question just makes PySpark work more like the main API. You probably don't want to design a workflow that requires a huge number of values per key, no matter what. There isn't a fix other than designing it differently.
I haven't tried this, and am only fairly sure this behavior is guaranteed, but, maybe you can sortBy
timestamp on the whole data set, and then foldByKey
. You provide a function that merges a previous value into a next value. This should encounter the data by timestamp. So you see row t, t+1 each time, and each time can just return row t+1 after augmenting it how you like.
Upvotes: 1