randomly initialized dataframe in spark

Question

I need to create a dataframe with n rows, and each column value of a row initialized with 0/1 randomly. An example dataframe would be:

+----+----+----+
| id | c1 | c2 |
+----+----+----+
|  1 |  0 |  1 |
|  2 |  1 |  1 |
|  3 |  1 |  0 |
+----+----+----+

Currently I am using following procedure:

create empty dataframe
generate individual sequence
append to existing dataframe using union()

The code is as follows:

for (k <- 0 until n) { 
  var newRow = k+:Seq.fill(N)(Random.nextInt(2)) // random fill with 0/1 and appending id
  X = X.union(newRow.toDF())
}

Does above method hurt performance(running time)? Is there any better way to do this?

Andronicus · Accepted Answer

There is an implicit method that creates DataFrame from Iterable in scala, you can make use of that provided, that it consists of tuples. The following code:

val a = (for (_ <- 0 until 5) yield Seq.fill(3)(Random.nextInt(2)))
    .map(x => (x(0), x(1), x(2)))
import spark.implicits._
a.toDF.show

Gives the following result:

+---+---+---+
| _1| _2| _3|
+---+---+---+
|  0|  1|  1|
|  1|  0|  0|
|  0|  0|  0|
|  0|  1|  0|
|  1|  1|  1|
+---+---+---+

You can provide a schema / rename the columns properly. More information on why those innner-structures have to be tuples can be found in this answer.

randomly initialized dataframe in spark

Answers (2)

Related Questions