Create RDD of arrays

Question

_{As I live in agony for Is Spark's KMeans unable to handle bigdata?, I want to create a minimal example to demonstrate the drama. For that, I want to create the data, rather than read it.}

Here is what my data looks like:

In [22]: realData.take(2)
Out[22]: 
[array([  84.35778894,  190.61634731,  121.61911155,  -42.2862512 ,
         -39.33345881,   56.73534546,  -15.59698061,  -86.12075349,
          85.48406906,   40.84118662,   -1.00725942,   -2.87201027,
         -78.0677815 ,  -18.80891205,  -92.39391945, -102.98860959,
         -10.59249313,   30.80641523,   87.49634549,  -78.3205944 ,
         -15.99765437,   33.36382612,  -14.10079251,   37.05977621,
         -30.02787349,  -46.48908886,   40.05820543,   12.34164613,
          60.59778037,   32.86144792,  -75.09426566,  -29.71363055,
         -24.45698228,   -7.22987835,   35.51963431,   36.92236462,
          84.71522683,  -30.15837915,    1.30921589,   29.79845728,
           7.77733962,   28.66041905,    6.55247136,   45.48181712,
         -24.81799125,   12.20440078,  -14.91224658,  -36.80905193,
          51.17004102,  -18.4527695 ,   12.35095124,   -3.73548334,
          -9.2651481 ,   19.53993158,   -0.28221419,   33.07089884,
           7.89205558,   -2.63194072,   13.32103665,    7.62146851,
         -41.3406389 ,   13.37658853,  -36.09437786,  -18.15283789]),
 array([ 227.63800054,   89.63235623,  -28.94679686, -171.95442583,
        -157.36636545,  -43.28729374,   97.31828944,  -45.66335323,
        -100.52371276,   16.04201854,   25.79787405,  -43.55558296,
         -23.43046377,  -53.12619721,  -10.16698475,  -88.88129402,
          77.19121455,   28.42062289,   -0.30305782,  -56.16625533,
        -100.88774848,   38.65317047,   37.17211943,   38.16609239,
         -50.05152587,   -8.73759989,  -49.98339921,  -21.65102389,
          13.39011805,   48.91359669,  -22.98882211,  -39.78551088,
         -52.06830607,   44.4193014 ,  -30.76970509, -109.19968443,
         -67.17202321,  -38.17445022,  -66.15981665,  -12.53127828,
         -29.50283995,  -72.71269849,  -85.92771623,   62.37326985,
         -25.44451665,   30.67529111,   19.77880449,   24.68152321,
         -62.80451881,   60.57287154,   22.31731031,   37.22992347,
          41.42355257,  -50.73447099,   -9.21878036,  -18.39200695,
         -11.15764727,   44.76715383,  -16.37372336,   -4.55888474,
          -4.26690754,   23.23691627,    0.25348381,  -37.4707463 ])]

It seems to be a list of arrays.

How to create this kind of data, with importing as less packages as possible?

Note: every element of the RDD is 64 dimensional vector. I plan to create 100m vectors.

Random values are also welcomed (for example within [-100, 100], I don't really care).

zero323 · Accepted Answer

Spark provides utilities for generating random RDDs out-of-the box. In PySpark these are located in pyspark.mllib.random.RandomRDDs. For example:

from pyspark.mllib.random import RandomRDDs

rdd = RandomRDDs.uniformVectorRDD(sc, 100000000, 64)

type(rdd.first())
## numpy.ndarray

rdd.first().shape
# (64,)

Create RDD of arrays

Answers (1)

Related Questions