DataFrame orderBy followed by limit in Spark

Question

I am having a program take generate a DataFrame on which it will run something like

    Select Col1, Col2...
    orderBy(ColX) limit(N)

However, when i collect the data in end, i find that it is causing the driver to OOM if I take a enough large top N

Also another observation is that if I just do sort and top, this problem will not happen. So this happen only when there is sort and top at the same time.

I am wondering why it could be happening? And particular, what is really going underneath this two combination of transforms? How does spark will evaluate query with both sorting and limit and what is corresponding execution plan underneath?

Also just curious does spark handle sort and top different between DataFrame and RDD?

EDIT, Sorry i didn't mean collect, what i original just mean that when i call any action to materialize the data, regardless of whether it is collect (or any action sending data back to driver) or not (So the problem is definitely not on the output size)

DataFrame orderBy followed by limit in Spark

Answers (1)

Related Questions