lte__
lte__

Reputation: 7576

Spark DataFrame - Select n random rows

I have a dataframe with multiple thousands of records, and I'd like to randomly select 1000 rows into another dataframe for demoing. How can I do this in Java?

Thank you!

Upvotes: 40

Views: 90628

Answers (5)

Cassio Alan Garcia
Cassio Alan Garcia

Reputation: 31

The way I found in Python to take a exact number of rows is:

sampleData = spark.createDataFrame(originalData.rdd.takeSample(withReplacement=False, num=1000, seed=42))

Upvotes: 2

apatry
apatry

Reputation: 807

In Python, You can shuffle the rows and then take the top ones:

import org.apache.spark.sql.functions.rand

dataset.orderBy(rand()).limit(n)

Upvotes: 59

s510
s510

Reputation: 2822

In Pyspark >= 3.1, try this:

sdf.sample(fraction=1.0).limit(n)

Upvotes: 7

dheeraj .A
dheeraj .A

Reputation: 1117

I would prefer this in pyspark

df.sample(withReplacement=False, fraction=desired_fraction)

Here is doc

Upvotes: 2

T. Gawęda
T. Gawęda

Reputation: 16076

You can try sample () method. Unfourtunatelly you must give there not a number, but fraction. You can write function like this:

def getRandom (dataset : Dataset[_], n : Int) = {
    val count = dataset.count();
    val howManyTake = if (count > n) n else count;
    dataset.sample(0, 1.0*howManyTake/count).limit (n)
}

Explanation: we must take a fraction of data. If we have 2000 rows and you want to get 100 rows, we must have 0.5 of total rows. If you want to get more rows than there are in DataFrame, you must get 1.0. limit () function is invoked to make sure that rounding is ok and you didn't get more rows than you specified.

Edit: I see in other answer the takeSample method. But remember:

  1. It'a a method of RDD, not Dataset, so you must do: dataset.rdd.takeSample(0, 1000, System.currentTimeMilis()).toDF() takeSample will collect all values.
  2. Remember that if you want to get very many rows then you will have problems with OutOfMemoryError as takeSample is collecting results in driver. Use it carefully

Upvotes: 17

Related Questions