Parallelizing independent actions on the same DataFrame in Spark

Question

Let's say I have a Spark DataFrame with the following schema:

root
 | -- prob: Double
 | -- word: String

I'd like to randomly select two different words from this DataFrame, but I'd like to perform this action X amount of times, so at the end I'll have X tuples of words selected at random, and of course every selection id independent of each other. How do I accomplish this?

EXAMPLE:

Let's say this is my data-set:

[(0.1,"blue"),(0.2,"yellow"),(0.1,"red"),(0.6,"green")]

where the first number id prob and the second is the word. For X=5 the output will be:

1. blue, green
2. green, yellow
3. green, yellow
4. yellow, blue
5. green, red

As they are independent actions, you can see that 2 and 3 are the same, and that's fine. But in every tuple, a word can only repeat once.

Parallelizing independent actions on the same DataFrame in Spark

Answers (1)

Related Questions