foobar
foobar

Reputation: 11374

How to write out an Array to file in Spark?

Let's say I have a DataFrame df that looks like:

+--------------------+
|            features|
+--------------------+
|[9.409448, 0.0, 0.3]|
|[9.055118, 2.0, 0.3]|
|[9.055118, 2.9, 0.2]|
+--------------------+

It has 1 column called "features", which is an array of floats.

How would I write it out to a csv file that looks like this?

9.409448, 0.0, 0.3
9.055118, 2.0, 0.3
9.055118, 2.9, 0.2

What I tried:

Idea: Maybe convert this to a Matrix somehow? I'm not sure how to do that.

Upvotes: 2

Views: 4146

Answers (1)

pault
pault

Reputation: 43494

Assuming that your schema is something like:

df.printSchema()
#root
# |-- features: array (nullable = true)
# |    |-- element: double (containsNull = true)

One idea is to cast your array of floats into an array of strings. Then you can call pyspark.sql.functions.concat_ws to extract the elements from inside the array (now strings) into one string.

For example, using ", " as the seperator:

import pyspark.sql.functions as f

df = df.select(
    f.concat_ws(", ", f.col("features").cast("array<string>")).alias("features")
)
df.show(truncate=False)
#+------------------+
#|features          |
#+------------------+
#|9.409448, 0.0, 0.3|
#|9.055118, 2.0, 0.3|
#|9.055118, 2.9, 0.2|
#+------------------+

As you can see from the schema, you now just have a string in the features column:

df.printSchema()
#root
# |-- features: string (nullable = false)

Update

When writing to a csv using pyspark.sql.DataFrameWriter.csv, the default behavior is to quote values if the separator appears as part of the value. To turn off quoting, set the quote option to empty string when you write the file.

Upvotes: 1

Related Questions