pyspark: expand a DenseVector to tuple in a RDD

Question

I have the following RDD, each record is a tuple of (bigint, vector):

myRDD.take(5)

[(1, DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432])),
 (1, DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432])),
 (0, DenseVector([5.0, 20.0, 0.3444, 0.3295, 54.3122, 4.0])),
 (1, DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432])),
 (1, DenseVector([9.2463, 2.0, 0.392, 0.3381, 162.6437, 7.9432]))]

How do I expand the Dense vector and make it part of a tuple? i.e. I want the above to become:

[(1, 9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432),
 (1, 9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432),
 (0, 5.0, 20.0, 0.3444, 0.3295, 54.3122, 4.0),
 (1, 9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432),
 (1, 9.2463, 2.0, 0.392, 0.3381, 162.6437, 7.9432)]

Thanks!

zero323 · Accepted Answer

Well, since pyspark.ml.linalg.DenseVector (or mllib) is iterbale (provide __len__ and __getitem__ methods) you can treat it like any other python collections, for example:

def as_tuple(kv):
    """
    >>> as_tuple((1, DenseVector([9.25, 1.0, 0.31, 0.31, 162.37])))
    (1, 9.25, 1.0, 0.31, 0.31, 162.37)
    """
    k, v = kv
    # Use *v.toArray() if you want to support Sparse one as well.
    return (k, *v)

For Python 2 replace:

(k, *v)

with:

from itertools import chain

tuple(chain([k], v))

or:

(k, ) + tuple(v)

If you want to convert values to Python (not NumPy) scalars use:

v.toArray().tolist()

in place of v.

pyspark: expand a DenseVector to tuple in a RDD

Answers (1)

Related Questions