Reputation: 127
I want to aggregate some columns and modify it in a particular schema format
id | name_en | name_sp | name_fr |
---|---|---|---|
1 | hello | hello_spanish | hello_french |
I want names in struct format so that when I call json.dumps or to_json it gets converted to appropriate json object.
I am using this ArrayType
schema = ArrayType(StructType([StructField("locale", StringType(), True),StructField("value", StringType(), True)]))
Preferably I would like to convert it to array of struct so that when I call json.dumps on whole table it converts it into following Json object
{
"id" : 1,
"names" : [{"locale" : "en", "value" : "hello"}, {"locale" : "sp", "value" : "hello_spanish"}, {"locale" : "fr", "value" : "hello_french"}]
}
and it does not become jsonObject with string and escape characters.
{
"id" : 1,
"names" : "[{\\"locale\\" : \\"en\\", \\"value\\" : \\"hello\\"}, {\\"locale\\" : \\"sp\\", \\"value\\" : \\"hello_spanish\\"}, {\\"locale\\" : \\"fr\\", \\"value\\" : \\"hello_french\\"}]"
}
Upvotes: 0
Views: 215
Reputation: 42342
You can construct the array of structs as:
df2 = df.selectExpr(
'id',
"""
array(
struct('en' as locale, name_en as value),
struct('sp' as locale, name_sp as value),
struct('fr' as locale, name_fr as value)
) as names
"""
)
df2.show(truncate=False)
+---+------------------------------------------------------+
|id |names |
+---+------------------------------------------------------+
|1 |[[en, hello], [sp, hello_spanish], [fr, hello_french]]|
+---+------------------------------------------------------+
To get as JSON, you can try:
df2.toJSON().collect()
# this gives:
# ['{"id":1,"names":[{"locale":"en","value":"hello"},
# {"locale":"sp","value":"hello_spanish"},
# {"locale":"fr","value":"hello_french"}]}']
Upvotes: 1