shakedzy
shakedzy

Reputation: 2893

Parallelizing independent actions on the same DataFrame in Spark

Let's say I have a Spark DataFrame with the following schema:

root
 | -- prob: Double
 | -- word: String

I'd like to randomly select two different words from this DataFrame, but I'd like to perform this action X amount of times, so at the end I'll have X tuples of words selected at random, and of course every selection id independent of each other. How do I accomplish this?

EXAMPLE:

Let's say this is my data-set:

[(0.1,"blue"),(0.2,"yellow"),(0.1,"red"),(0.6,"green")]

where the first number id prob and the second is the word. For X=5 the output will be:

1. blue, green
2. green, yellow
3. green, yellow
4. yellow, blue
5. green, red

As they are independent actions, you can see that 2 and 3 are the same, and that's fine. But in every tuple, a word can only repeat once.

Upvotes: 1

Views: 510

Answers (1)

Yehor Krivokon
Yehor Krivokon

Reputation: 877

1) You can use one of this DataFrame methods:

  • randomSplit(weights: Array[Double], seed: Long)
  • randomSplitAsList(weights: Array[Double], seed: Long) or
  • sample(withReplacement: Boolean, fraction: Double)

and then take first two Rows.

2) Shuffle rows and take first two of them.

import org.apache.spark.sql.functions.rand
dataset.orderBy(rand()).limit(n)

3) Or you can use takeSample method of the RDD and then convert it to a DataFrame:

def takeSample(
      withReplacement: Boolean,
      num: Int,
      seed: Long = Utils.random.nextLong): Array[T]

For example:

dataframe.rdd.takeSample(true, 1000).toDF()

Upvotes: 1

Related Questions