Bagel912
Bagel912

Reputation: 331

Pyspark converting an array of struct into string

I have the following dataframe in Pyspark

+----+-------+-----+                                                            
|name|subject|score|
+----+-------+-----+
| Tom|   math|   90|
| Tom|physics|   70|
| Amy|   math|   95|
+----+-------+-----+

I used collect_list and struct function from pyspark.sql.functions

df.groupBy('name').agg(collect_list(struct('subject', 'score')).alias('score_list'))

to get the following dataframe

+----+--------------------+
|name|          score_list|
+----+--------------------+
| Tom|[[math, 90], [phy...|
| Amy|        [[math, 95]]|
+----+--------------------+

My question is how can I transform the last column score_list into string and dump it into a csv file looks like

Tom     (math, 90) | (physics, 70)
Amy     (math, 95)

Appreciate for any help, thanks.

Update: Here is a similar question but it's not exactly the same because it goes directly from string to another string. In my case, I want to first transfer string to collect_list<struct> and finally stringify this collect_list<struct>.

Upvotes: 4

Views: 7321

Answers (2)

jxc
jxc

Reputation: 13998

Per your Update and comment, for Spark 2.4.0+, here is one way to stringify an array of structs with Spark SQL builtin functions: transform and array_join:

>>> df.printSchema()
root
 |-- name: string (nullable = true)
 |-- score_list: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- subject: string (nullable = true)
 |    |    |-- score: integer (nullable = true)

>>> df.show(2,0)
+----+---------------------------+
|name|score_list                 |
+----+---------------------------+
|Tom |[[math, 90], [physics, 70]]|
|Amy |[[math, 95]]               |
+----+---------------------------+

>>> df1.selectExpr(
        "name"
      , """
         array_join(
             transform(score_list, x -> concat('(', x.subject, ', ', x.score, ')'))
           , ' | '
         ) AS score_list
        """
).show(2,0)

+----+--------------------------+
|name|score_list                |
+----+--------------------------+
|Tom |(math, 90) | (physics, 70)|
|Amy |(math, 95)                |
+----+--------------------------+

Where:

  1. Use transform() to convert array of structs into array of strings. for each array element (the struct x), we use concat('(', x.subject, ', ', x.score, ')') to convert it into a string.
  2. Use array_join() to join all array elements(StringType) with | , this will return the final string

Upvotes: 8

pault
pault

Reputation: 43494

The duplicates I linked don't exactly answer your question, since you're combining multiple columns. Nevertheless you can modify the solutions to fit your desired output quite easily.

Just replace the struct with concat_ws. Also use concat to add an opening and closing parentheses to get the output you desire.

from pyspark.sql.functions import concat, concat_ws, lit

df = df.groupBy('name')\
    .agg(
        concat_ws(
            " | ", 
            collect_list(
                concat(lit("("), concat_ws(", ", 'subject', 'score'), lit(")"))
            )
        ).alias('score_list')
    )
df.show(truncate=False)

#+----+--------------------------+
#|name|score_list                |
#+----+--------------------------+
#|Tom |(math, 90) | (physics, 70)|
#|Amy |(math, 95)                |
#+----+--------------------------+

Note that since the comma appears in the score_list column, this value will be quoted when you write to csv if you use the default arguments.

For example:

df.coalesce(1).write.csv("test.csv")

Would produce the following output file:

Tom,"(math, 90) | (physics, 70)"
Amy,"(math, 95)"

Upvotes: 4

Related Questions