Reputation: 106
I need to create a dataframe with n rows, and each column value of a row initialized with 0/1 randomly. An example dataframe would be:
+----+----+----+
| id | c1 | c2 |
+----+----+----+
| 1 | 0 | 1 |
| 2 | 1 | 1 |
| 3 | 1 | 0 |
+----+----+----+
Currently I am using following procedure:
The code is as follows:
for (k <- 0 until n) {
var newRow = k+:Seq.fill(N)(Random.nextInt(2)) // random fill with 0/1 and appending id
X = X.union(newRow.toDF())
}
Does above method hurt performance(running time)? Is there any better way to do this?
Upvotes: 2
Views: 236
Reputation: 11
Does above method hurt performance(running time)?
In a quite a few ways, but primarily as a result of growing lineage and execution plan. Additionally calling toDF
on a local sequence will keep all data in memory of the driver.
In other words - it doesn't scale at all.
Is there any better way to do this?
Of course there is:
import org.apache.spark.sql.functions.rand
spark.range(n).select(
$"id" + 1 as "id",
(rand() > 0.5) cast("integer") as "c1", (rand() > 0.5) cast("integer") as "c2")
Upvotes: 0
Reputation: 26046
There is an implicit method that creates DataFrame
from Iterable
in scala, you can make use of that provided, that it consists of tuples. The following code:
val a = (for (_ <- 0 until 5) yield Seq.fill(3)(Random.nextInt(2)))
.map(x => (x(0), x(1), x(2)))
import spark.implicits._
a.toDF.show
Gives the following result:
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 0| 1| 1|
| 1| 0| 0|
| 0| 0| 0|
| 0| 1| 0|
| 1| 1| 1|
+---+---+---+
You can provide a schema / rename the columns properly. More information on why those innner-structures have to be tuples can be found in this answer.
Upvotes: 3