Reputation: 25376
I have the following RDD, each record is a tuple of (bigint, vector):
myRDD.take(5)
[(1, DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432])),
(1, DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432])),
(0, DenseVector([5.0, 20.0, 0.3444, 0.3295, 54.3122, 4.0])),
(1, DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432])),
(1, DenseVector([9.2463, 2.0, 0.392, 0.3381, 162.6437, 7.9432]))]
How do I expand the Dense vector and make it part of a tuple? i.e. I want the above to become:
[(1, 9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432),
(1, 9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432),
(0, 5.0, 20.0, 0.3444, 0.3295, 54.3122, 4.0),
(1, 9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432),
(1, 9.2463, 2.0, 0.392, 0.3381, 162.6437, 7.9432)]
Thanks!
Upvotes: 0
Views: 325
Reputation: 330353
Well, since pyspark.ml.linalg.DenseVector
(or mllib
) is iterbale (provide __len__
and __getitem__
methods) you can treat it like any other python collections, for example:
def as_tuple(kv):
"""
>>> as_tuple((1, DenseVector([9.25, 1.0, 0.31, 0.31, 162.37])))
(1, 9.25, 1.0, 0.31, 0.31, 162.37)
"""
k, v = kv
# Use *v.toArray() if you want to support Sparse one as well.
return (k, *v)
For Python 2 replace:
(k, *v)
with:
from itertools import chain
tuple(chain([k], v))
or:
(k, ) + tuple(v)
If you want to convert values to Python (not NumPy) scalars use:
v.toArray().tolist()
in place of v
.
Upvotes: 1