Reputation: 2893
Let's say I have a Spark DataFrame
with the following schema:
root
| -- prob: Double
| -- word: String
I'd like to randomly select two different words from this DataFrame
, but I'd like to perform this action X amount of times, so at the end I'll have X tuples of words selected at random, and of course every selection id independent of each other. How do I accomplish this?
EXAMPLE:
Let's say this is my data-set:
[(0.1,"blue"),(0.2,"yellow"),(0.1,"red"),(0.6,"green")]
where the first number id prob
and the second is the word
. For X=5 the output will be:
1. blue, green
2. green, yellow
3. green, yellow
4. yellow, blue
5. green, red
As they are independent actions, you can see that 2 and 3 are the same, and that's fine. But in every tuple, a word can only repeat once.
Upvotes: 1
Views: 510
Reputation: 877
1) You can use one of this DataFrame methods:
randomSplit(weights: Array[Double], seed: Long)
randomSplitAsList(weights: Array[Double], seed: Long)
or sample(withReplacement: Boolean, fraction: Double)
and then take first two Rows.
2) Shuffle rows and take first two of them.
import org.apache.spark.sql.functions.rand
dataset.orderBy(rand()).limit(n)
3) Or you can use takeSample
method of the RDD and then convert it to a DataFrame:
def takeSample(
withReplacement: Boolean,
num: Int,
seed: Long = Utils.random.nextLong): Array[T]
For example:
dataframe.rdd.takeSample(true, 1000).toDF()
Upvotes: 1