Reputation: 331
I have the following dataframe in Pyspark
+----+-------+-----+
|name|subject|score|
+----+-------+-----+
| Tom| math| 90|
| Tom|physics| 70|
| Amy| math| 95|
+----+-------+-----+
I used collect_list
and struct
function from pyspark.sql.functions
df.groupBy('name').agg(collect_list(struct('subject', 'score')).alias('score_list'))
to get the following dataframe
+----+--------------------+
|name| score_list|
+----+--------------------+
| Tom|[[math, 90], [phy...|
| Amy| [[math, 95]]|
+----+--------------------+
My question is how can I transform the last column score_list
into string and dump it into a csv file looks like
Tom (math, 90) | (physics, 70)
Amy (math, 95)
Appreciate for any help, thanks.
Update: Here is a similar question but it's not exactly the same because it goes directly from string
to another string
. In my case, I want to first transfer string
to collect_list<struct>
and finally stringify this collect_list<struct>
.
Upvotes: 4
Views: 7321
Reputation: 13998
Per your Update and comment, for Spark 2.4.0+, here is one way to stringify an array of structs with Spark SQL builtin functions: transform and array_join:
>>> df.printSchema()
root
|-- name: string (nullable = true)
|-- score_list: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- subject: string (nullable = true)
| | |-- score: integer (nullable = true)
>>> df.show(2,0)
+----+---------------------------+
|name|score_list |
+----+---------------------------+
|Tom |[[math, 90], [physics, 70]]|
|Amy |[[math, 95]] |
+----+---------------------------+
>>> df1.selectExpr(
"name"
, """
array_join(
transform(score_list, x -> concat('(', x.subject, ', ', x.score, ')'))
, ' | '
) AS score_list
"""
).show(2,0)
+----+--------------------------+
|name|score_list |
+----+--------------------------+
|Tom |(math, 90) | (physics, 70)|
|Amy |(math, 95) |
+----+--------------------------+
Where:
x
), we use concat('(', x.subject, ', ', x.score, ')')
to convert it into a string.|
, this will return the final stringUpvotes: 8
Reputation: 43494
The duplicates I linked don't exactly answer your question, since you're combining multiple columns. Nevertheless you can modify the solutions to fit your desired output quite easily.
Just replace the struct
with concat_ws
. Also use concat
to add an opening and closing parentheses to get the output you desire.
from pyspark.sql.functions import concat, concat_ws, lit
df = df.groupBy('name')\
.agg(
concat_ws(
" | ",
collect_list(
concat(lit("("), concat_ws(", ", 'subject', 'score'), lit(")"))
)
).alias('score_list')
)
df.show(truncate=False)
#+----+--------------------------+
#|name|score_list |
#+----+--------------------------+
#|Tom |(math, 90) | (physics, 70)|
#|Amy |(math, 95) |
#+----+--------------------------+
Note that since the comma appears in the score_list
column, this value will be quoted when you write to csv
if you use the default arguments.
For example:
df.coalesce(1).write.csv("test.csv")
Would produce the following output file:
Tom,"(math, 90) | (physics, 70)"
Amy,"(math, 95)"
Upvotes: 4