jnaour
jnaour

Reputation: 299

groupByKey with millions of rows by key

Context:

We tried:

Changing the memory allowed to executor/driver worked. It worked only for 10k or 100k rows by key. What about millions of rows by key that could happend in the future.

It seems that there is some work on that kind of issues : https://github.com/apache/spark/pull/1977

But it's specific for PySpark and not for the Scala API that we used currently

My questions are:

Upvotes: 0

Views: 524

Answers (1)

Sean Owen
Sean Owen

Reputation: 66886

I think the change in question just makes PySpark work more like the main API. You probably don't want to design a workflow that requires a huge number of values per key, no matter what. There isn't a fix other than designing it differently.

I haven't tried this, and am only fairly sure this behavior is guaranteed, but, maybe you can sortBy timestamp on the whole data set, and then foldByKey. You provide a function that merges a previous value into a next value. This should encounter the data by timestamp. So you see row t, t+1 each time, and each time can just return row t+1 after augmenting it how you like.

Upvotes: 1

Related Questions