cantdutchthis
cantdutchthis

Reputation: 34547

create a multidimensional random matrix in spark

With the python API of Spark I am able to quickly create an RDD vector with random normal number and perform a calculation with the following code:

from pyspark.mllib.random import RandomRDDs
RandomRDDs.uniformRDD(sc, 1000000L, 10).sum()

where sc is an available SparkContext. The upside of this approach is that it is very performant, the downside is that I am not able to create a random matrix this way.

You could create use numpy again, but this isn't performant.

%%time
sc.parallelize(np.random.rand(1000000,2)).sum()
array([ 499967.0714618 ,  499676.50123474])
CPU times: user 52.7 ms, sys: 31.1 ms, total: 83.9 ms
Wall time: 669 ms

For comparison with Spark:

%%time
RandomRDDs.uniformRDD(sc, 2000000, 10).sum()
999805.091403467
CPU times: user 4.54 ms, sys: 1.89 ms, total: 6.43 ms
Wall time: 183 ms

Is there a performant way to create random matrices/RDD's that contain more than one dimension with the Python Spark API?

Upvotes: 3

Views: 1372

Answers (1)

cantdutchthis
cantdutchthis

Reputation: 34547

Spark evolved a bit since this question was asked and Spark will probably have better support still in the future.

In the meantime you can be a bit creative with the .zip method of RDD's as well as DataFrames to get close to what numpy can do. It is a bit more verbose, but it works.

n = 100000
p1 = RandomRDDs.uniformRDD(sc, n).zip(RandomRDDs.uniformRDD(sc, n))
p2 = RandomRDDs.uniformRDD(sc, n).zip(RandomRDDs.uniformRDD(sc, n))

point_rdd = p1.zip(p2)\
.map(lambda r: Row(x1=r[0][0], y1 = r[0][1], x2=r[1][0], y2 = r[1][1]))

Upvotes: 1

Related Questions