How to write out an Array to file in Spark?

Question

Let's say I have a DataFrame df that looks like:

+--------------------+
|            features|
+--------------------+
|[9.409448, 0.0, 0.3]|
|[9.055118, 2.0, 0.3]|
|[9.055118, 2.9, 0.2]|
+--------------------+

It has 1 column called "features", which is an array of floats.

How would I write it out to a csv file that looks like this?

9.409448, 0.0, 0.3
9.055118, 2.0, 0.3
9.055118, 2.9, 0.2

What I tried:

Write the dataframe with DataFrameWriter - but it complains that writing as a csv can't handle arrays.
posexplode and pivot so the DataFrame has 3 columns, one for each number. But this feels very inefficient, especially when I have more columns.

Idea: Maybe convert this to a Matrix somehow? I'm not sure how to do that.

pault · Accepted Answer

Assuming that your schema is something like:

df.printSchema()
#root
# |-- features: array (nullable = true)
# |    |-- element: double (containsNull = true)

One idea is to cast your array of floats into an array of strings. Then you can call pyspark.sql.functions.concat_ws to extract the elements from inside the array (now strings) into one string.

For example, using ", " as the seperator:

import pyspark.sql.functions as f

df = df.select(
    f.concat_ws(", ", f.col("features").cast("array")).alias("features")
)
df.show(truncate=False)
#+------------------+
#|features          |
#+------------------+
#|9.409448, 0.0, 0.3|
#|9.055118, 2.0, 0.3|
#|9.055118, 2.9, 0.2|
#+------------------+

As you can see from the schema, you now just have a string in the features column:

df.printSchema()
#root
# |-- features: string (nullable = false)

Update

When writing to a csv using pyspark.sql.DataFrameWriter.csv, the default behavior is to quote values if the separator appears as part of the value. To turn off quoting, set the quote option to empty string when you write the file.

How to write out an Array to file in Spark?

Answers (1)

Related Questions