Sasank Annavarapu
Sasank Annavarapu

Reputation: 106

randomly initialized dataframe in spark

I need to create a dataframe with n rows, and each column value of a row initialized with 0/1 randomly. An example dataframe would be:

+----+----+----+
| id | c1 | c2 |
+----+----+----+
|  1 |  0 |  1 |
|  2 |  1 |  1 |
|  3 |  1 |  0 |
+----+----+----+

Currently I am using following procedure:

The code is as follows:

for (k <- 0 until n) { 
  var newRow = k+:Seq.fill(N)(Random.nextInt(2)) // random fill with 0/1 and appending id
  X = X.union(newRow.toDF())
}

Does above method hurt performance(running time)? Is there any better way to do this?

Upvotes: 2

Views: 236

Answers (2)

user11227113
user11227113

Reputation: 11

Does above method hurt performance(running time)?

In a quite a few ways, but primarily as a result of growing lineage and execution plan. Additionally calling toDF on a local sequence will keep all data in memory of the driver.

In other words - it doesn't scale at all.

Is there any better way to do this?

Of course there is:

import org.apache.spark.sql.functions.rand

spark.range(n).select(
  $"id" + 1 as "id", 
  (rand() > 0.5) cast("integer") as "c1", (rand() > 0.5) cast("integer") as "c2")

Upvotes: 0

Andronicus
Andronicus

Reputation: 26046

There is an implicit method that creates DataFrame from Iterable in scala, you can make use of that provided, that it consists of tuples. The following code:

val a = (for (_ <- 0 until 5) yield Seq.fill(3)(Random.nextInt(2)))
    .map(x => (x(0), x(1), x(2)))
import spark.implicits._
a.toDF.show

Gives the following result:

+---+---+---+
| _1| _2| _3|
+---+---+---+
|  0|  1|  1|
|  1|  0|  0|
|  0|  0|  0|
|  0|  1|  0|
|  1|  1|  1|
+---+---+---+

You can provide a schema / rename the columns properly. More information on why those innner-structures have to be tuples can be found in this answer.

Upvotes: 3

Related Questions