user9941064
user9941064

Reputation:

How to create large spark data frame with random content using scala?

I need to create large spark data frame with 1000+ columns, 10M+ rows, 1000 partitions with random data for testing. I know I need to create a large rdd and apply schema on it using spark.sqlContext.createDataFrame(rdd, schema) So far I have created schema using val schema = StructType((0 to 1000).map(n => StructField(s"column_$n", IntegerType))) I'm stuck with generating large RDD with random content. How do I do it?

Upvotes: 3

Views: 812

Answers (1)

user9941064
user9941064

Reputation:

Got it working using RandomRDD from mllib package

import org.apache.spark.mllib.random.RandomRDDs._
val rdd = normalRDD(sc, 1000000L, 10).map(m =>  Row(schema.map(_ => Array.fill(1000)(m).mkString).toList: _*))
 val schema = StructType((0 to 2000).map(n => StructField(s"column_$n", IntegerType)))
  spark.sqlContext.createDataFrame(rows, schema)

Upvotes: 2

Related Questions