Convert a column of JSON list to a DataFrame

Question

I'm loading a file from a jdbc, which has a JSON object formated as:

[
    {
        "numero": 1,
        "resposta": "A",
        "peso": 2
    },
    {
        "numero": 2,
        "resposta": "A",
        "peso": 1
    },
    ...
]

Its datatype is set as json (postgresql), but when loading in into spark, it gets loaded with newline and tab characters:

I tried using the following schema, which results in null (I imagine thats because I have to iterate through the list, but I'm not sure how to do that):

schema = StructType(
    [
    StructField("peso", IntegerType(), False),
    StructField("numero", IntegerType(), False), 
    StructField("resposta", StringType(), False) 
    ]
)
questoes.withColumn("questoes", from_json("questoes", schema)).show(truncate=200)

Output:

Desired DataFrame:

numero	resposta	peso
1	A	2
2	A	1

...

Code used to read from the DB:

spark = SparkSession.builder.config(
    'spark.driver.extraClassPath', 'C:/Users/vitor/AppData/Roaming/DBeaverData/drivers/maven/maven-central/org.postgresql/postgresql-42.2.25.jar').getOrCreate()
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

url = 'jdbc:postgresql://localhost:5432/informacoes_concursos'
properties = {'user': 'postgres', 'password': '123'}

gabaritos = spark.read.jdbc(url, table="gabaritos", properties=properties)
concursos = spark.read.jdbc(url, table="concursos", properties=properties)

Edit: I fixed the newline and tab characters by changing the dtype from json to jsonb.

Convert a column of JSON list to a DataFrame

Answers (1)

Related Questions