Reputation: 1800
I need a Dataset<Double>
of arbitrary size filled with random or generated values.
It seems it can be done by implementing RDD
, and generating values inside compute
method.
Is there a better solution?
Upvotes: 3
Views: 1598
Reputation: 2838
You can try Random Data Generation
import org.apache.spark.sql.functions.{rand, randn}
val dfr = sqlContext.range(0,10) // range can be what you want
val randomValues = dfr.select("id")
.withColumn("uniform", rand(10L))
.withColumn("normal", randn(10L))
randomValues.show(truncate = false)
output
+---+-------------------+--------------------+
|id |uniform |normal |
+---+-------------------+--------------------+
|0 |0.41371264720975787|-0.5877482396744728 |
|1 |0.7311719281896606 |1.5746327759749246 |
|2 |0.1982919638208397 |-0.256535324205377 |
|3 |0.12714181165849525|-0.31703264334668824|
|4 |0.7604318153406678 |0.4977629425313746 |
|5 |0.12030715258495939|-0.506853671746243 |
|6 |0.12131363910425985|1.4250903895905769 |
|7 |0.44292918521277047|-0.1413699193557902 |
|8 |0.8898784253886249 |0.9657665088756656 |
|9 |0.03650707717266999|-0.5021009082343131 |
+---+-------------------+--------------------+
Upvotes: 3
Reputation: 1023
Another way of doing it,
scala> val ds = spark.range(100)
ds: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> val randDS = ds.withColumn("randomDouble", rand(100)).drop("id").as[Double]
randDS: org.apache.spark.sql.Dataset[Double] = [randomDouble: double]
scala> randDS.show
+--------------------+
| randomDouble|
+--------------------+
| 0.6841403791584381|
| 0.21180593775249568|
|0.020396922902442105|
| 0.3372830927732784|
| 0.967636350481069|
| 0.6420539234134518|
| 0.33027994655769854|
| 0.8027165538297113|
| 0.9938809031700999|
| 0.8346083871437393|
| 0.13512419677124388|
|0.061866246009553594|
| 0.5243597971107068|
| 0.38257478262291045|
| 0.6753627729921755|
| 0.9631590027671125|
| 0.14234112716353464|
| 0.38649575105988976|
| 0.7687994020915501|
| 0.8436272154312096|
+--------------------+
Upvotes: 0
Reputation: 6323
Not sure if it helps you, but have a look -
val end = 100 // change this as required
val ds = spark.sql(s"select value from values (sequence(0, $end)) T(value)")
.selectExpr("explode(value) as value").selectExpr("(value * rand()) value")
.as(Encoders.DOUBLE)
ds.show(false)
ds.printSchema()
/**
* +-------------------+
* |value |
* +-------------------+
* |0.0 |
* |0.6598598027815629 |
* |0.34305452447822704|
* |0.2421654251914631 |
* |3.1937041196518896 |
* |0.9120972627613766 |
* |3.307431250924596 |
*
* root
* |-- value: double (nullable = false)
*/
Upvotes: 0