A007
A007

Reputation: 127

Aggregate columns and change to a ArrayType schema in pyspark

I want to aggregate some columns and modify it in a particular schema format

id name_en name_sp name_fr
1 hello hello_spanish hello_french

I want names in struct format so that when I call json.dumps or to_json it gets converted to appropriate json object.

I am using this ArrayType

schema = ArrayType(StructType([StructField("locale", StringType(), True),StructField("value", StringType(), True)]))

Preferably I would like to convert it to array of struct so that when I call json.dumps on whole table it converts it into following Json object

{
 "id" : 1,
  "names" : [{"locale" : "en", "value" : "hello"}, {"locale" : "sp", "value" : "hello_spanish"}, {"locale" : "fr", "value" : "hello_french"}]
} 

and it does not become jsonObject with string and escape characters.

{
 "id" : 1,
  "names" : "[{\\"locale\\" : \\"en\\", \\"value\\" : \\"hello\\"}, {\\"locale\\" : \\"sp\\", \\"value\\" : \\"hello_spanish\\"}, {\\"locale\\" : \\"fr\\", \\"value\\" : \\"hello_french\\"}]"
}

Upvotes: 0

Views: 215

Answers (1)

mck
mck

Reputation: 42342

You can construct the array of structs as:

df2 = df.selectExpr(
    'id', 
    """
    array(
        struct('en' as locale, name_en as value),
        struct('sp' as locale, name_sp as value),
        struct('fr' as locale, name_fr as value)
    ) as names
    """
)

df2.show(truncate=False)
+---+------------------------------------------------------+
|id |names                                                 |
+---+------------------------------------------------------+
|1  |[[en, hello], [sp, hello_spanish], [fr, hello_french]]|
+---+------------------------------------------------------+

To get as JSON, you can try:

df2.toJSON().collect()

# this gives:
# ['{"id":1,"names":[{"locale":"en","value":"hello"},
#   {"locale":"sp","value":"hello_spanish"},
#   {"locale":"fr","value":"hello_french"}]}']

Upvotes: 1

Related Questions