Reputation: 11374
Let's say I have a DataFrame df
that looks like:
+--------------------+
| features|
+--------------------+
|[9.409448, 0.0, 0.3]|
|[9.055118, 2.0, 0.3]|
|[9.055118, 2.9, 0.2]|
+--------------------+
It has 1 column called "features", which is an array of floats.
How would I write it out to a csv file that looks like this?
9.409448, 0.0, 0.3
9.055118, 2.0, 0.3
9.055118, 2.9, 0.2
What I tried:
Idea: Maybe convert this to a Matrix somehow? I'm not sure how to do that.
Upvotes: 2
Views: 4146
Reputation: 43494
Assuming that your schema is something like:
df.printSchema()
#root
# |-- features: array (nullable = true)
# | |-- element: double (containsNull = true)
One idea is to cast your array of floats into an array of strings. Then you can call pyspark.sql.functions.concat_ws
to extract the elements from inside the array (now strings) into one string.
For example, using ", "
as the seperator:
import pyspark.sql.functions as f
df = df.select(
f.concat_ws(", ", f.col("features").cast("array<string>")).alias("features")
)
df.show(truncate=False)
#+------------------+
#|features |
#+------------------+
#|9.409448, 0.0, 0.3|
#|9.055118, 2.0, 0.3|
#|9.055118, 2.9, 0.2|
#+------------------+
As you can see from the schema, you now just have a string in the features
column:
df.printSchema()
#root
# |-- features: string (nullable = false)
Update
When writing to a csv using pyspark.sql.DataFrameWriter.csv
, the default behavior is to quote values if the separator appears as part of the value. To turn off quoting, set the quote
option to empty string when you write the file.
Upvotes: 1