user1234579
user1234579

Reputation: 179

pyspark | transforming list of numpy arrays into columns in dataframe

I am trying to take an rdd that looks like:

[<1x24000 sparse matrix of type '' with 10 stored elements in Compressed Sparse Row format>, . . . ]

and ideally turn it into a dataframe that looks like:

<code>
   +-----------------+
   |  A  |  B  |   C |
   +-----------------+
   | 1.0 | 0.0 | 0.0 |
   +-----+-----+-----+
   | 1.0 | 1.0 | 0.0 |
   +-----+-----+-----+
</code>

However, I keep getting this:

<code>
+---------------+
|             _1|
+---------------+
|[1.0, 0.0, 0.0]|
+---------------+
|[1.0, 1.0, 0.0]|
+---------------+
</code>

I am having the darnedest time because each row is filled with numpy arrays.

I used this code to create the dataframe from the rdd:

<code>res.flatMap(lambda x: np.array(x.todense())).map(list).map(lambda l : Row([float(x) for x in l])).toDF()</code>

**Explode does not help (it puts everything into the same column)

** I tried using a UDF on the resulting dataframe but I cannot seem to separate the numpy array into individual values.

Please help!

Upvotes: 0

Views: 1752

Answers (1)

user6022341
user6022341

Reputation:

Try:

.map(lambda l : Row(*[float(x) for x in l]))

Upvotes: 1

Related Questions