Reputation: 127
I am facing very weird problem while using collect() function
data = df.select("node_id", "bin", "type", "jsonObj").collect()
JsonObj looks something like this :
[
{
"id" : 1,
"name" : "hello"
},
{
"id" : 2,
"name" : "world"
}
]
now in this when I iterate through list made by collect function and print row["jsonObj"] I get my JSON object as part of string rather than just the JSON object. like now I am getting " ' " added to each object in array. The problem with this is when I try to write it in a file it becomes array of string rather than array of json object
['{
"id" : 1,
"name" : "hello"
}',
'{
"id" : 2,
"name" : "world"
}'
]
Did anyone else face same issue? I just want to store JsonObj as it is to the file rather than as string.
node_id | bin | type | jsonObj |
---|---|---|---|
1 | a | type1 | [ { "id" : 11, "name" : "hello" }, { "id" : 12, "name" : "world" } ] |
root
|-- node_id: long (nullable = true)
|-- bin: string (nullable = true)
|-- type: string (nullable = true)
|-- jsonObj: array (nullable = true)
Upvotes: 0
Views: 1579
Reputation: 42352
You can convert the JSON string to a struct using from_json
:
import pyspark.sql.functions as F
from pyspark.sql.types import *
df2 = df.withColumn(
"jsonObj",
F.from_json(
F.col('jsonObj').cast('string'),
ArrayType(StructType([StructField('id', IntegerType()), StructField('name', StringType())]))
)
)
df2.show(truncate=False)
+-------+---+-----+--------------------------+
|node_id|bin|type |jsonObj |
+-------+---+-----+--------------------------+
|1 |a |type1|[[11, hello], [12, world]]|
+-------+---+-----+--------------------------+
df2.write.json('filepath')
which should give the output as
{"node_id":"1","bin":"a","type":"type1","jsonObj":[{"id":11,"name":"hello"},{"id":12,"name":"world"}]}
Upvotes: 1