groupByKey with millions of rows by key

Question

Context:

Aggregation by key with potentially millions of rows by key.
Add features in row. To do that we have to know the previous row (by key and by timestamp). For the moment we used groupByKey and do the work on the Iterable.

We tried:

Add more memory to executor/driver
Change the number of partitions

Changing the memory allowed to executor/driver worked. It worked only for 10k or 100k rows by key. What about millions of rows by key that could happend in the future.

It seems that there is some work on that kind of issues : https://github.com/apache/spark/pull/1977

But it's specific for PySpark and not for the Scala API that we used currently

My questions are:

Is it better that I wait for new features that handle this kind of issues knowing that I have to work specifically in PySpark?
Another solution would be to implement the workflow differently using some specific keys, values to handle my needs. Any design pattern for that. For example with the need to have the previsous row by key and by timestamp to add fetures?

Sean Owen · Accepted Answer

I think the change in question just makes PySpark work more like the main API. You probably don't want to design a workflow that requires a huge number of values per key, no matter what. There isn't a fix other than designing it differently.

I haven't tried this, and am only fairly sure this behavior is guaranteed, but, maybe you can sortBy timestamp on the whole data set, and then foldByKey. You provide a function that merges a previous value into a next value. This should encounter the data by timestamp. So you see row t, t+1 each time, and each time can just return row t+1 after augmenting it how you like.

groupByKey with millions of rows by key

Answers (1)

Related Questions