python dataframe collect() function

Question

I am facing very weird problem while using collect() function

data = df.select("node_id", "bin", "type", "jsonObj").collect()

JsonObj looks something like this :

[
 {
   "id" : 1,
   "name" : "hello"
 },
 {
   "id" : 2,
   "name" : "world"
 }
]

now in this when I iterate through list made by collect function and print row["jsonObj"] I get my JSON object as part of string rather than just the JSON object. like now I am getting " ' " added to each object in array. The problem with this is when I try to write it in a file it becomes array of string rather than array of json object

['{
   "id" : 1,
   "name" : "hello"
 }',
 '{
   "id" : 2,
   "name" : "world"
 }'
]

Did anyone else face same issue? I just want to store JsonObj as it is to the file rather than as string.

node_id	bin	type	jsonObj
1	a	type1	[ { "id" : 11, "name" : "hello" }, { "id" : 12, "name" : "world" } ]

root
 |-- node_id: long (nullable = true)
 |-- bin: string (nullable = true)
 |-- type: string (nullable = true)
 |-- jsonObj: array (nullable = true)

mck · Accepted Answer

You can convert the JSON string to a struct using from_json:

import pyspark.sql.functions as F
from pyspark.sql.types import *

df2 = df.withColumn(
    "jsonObj",
    F.from_json(
        F.col('jsonObj').cast('string'), 
        ArrayType(StructType([StructField('id', IntegerType()), StructField('name', StringType())]))
    )
)

df2.show(truncate=False)
+-------+---+-----+--------------------------+
|node_id|bin|type |jsonObj                   |
+-------+---+-----+--------------------------+
|1      |a  |type1|[[11, hello], [12, world]]|
+-------+---+-----+--------------------------+

df2.write.json('filepath')

which should give the output as

{"node_id":"1","bin":"a","type":"type1","jsonObj":[{"id":11,"name":"hello"},{"id":12,"name":"world"}]}

python dataframe collect() function

Answers (1)

Related Questions