Vitor L.
Vitor L.

Reputation: 41

Convert a column of JSON list to a DataFrame

I'm loading a file from a jdbc, which has a JSON object formated as:

[
    {
        "numero": 1,
        "resposta": "A",
        "peso": 2
    },
    {
        "numero": 2,
        "resposta": "A",
        "peso": 1
    },
    ...
]

Its datatype is set as json (postgresql), but when loading in into spark, it gets loaded with newline and tab characters:

enter image description here

I tried using the following schema, which results in null (I imagine thats because I have to iterate through the list, but I'm not sure how to do that):

schema = StructType(
    [
    StructField("peso", IntegerType(), False),
    StructField("numero", IntegerType(), False), 
    StructField("resposta", StringType(), False) 
    ]
)
questoes.withColumn("questoes", from_json("questoes", schema)).show(truncate=200)

Output:

enter image description here

Desired DataFrame:

numero resposta peso
1 A 2
2 A 1

...

Code used to read from the DB:

spark = SparkSession.builder.config(
    'spark.driver.extraClassPath', 'C:/Users/vitor/AppData/Roaming/DBeaverData/drivers/maven/maven-central/org.postgresql/postgresql-42.2.25.jar').getOrCreate()
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

url = 'jdbc:postgresql://localhost:5432/informacoes_concursos'
properties = {'user': 'postgres', 'password': '123'}

gabaritos = spark.read.jdbc(url, table="gabaritos", properties=properties)
concursos = spark.read.jdbc(url, table="concursos", properties=properties)

Edit: I fixed the newline and tab characters by changing the dtype from json to jsonb.

Upvotes: 0

Views: 89

Answers (1)

Oli
Oli

Reputation: 10406

So there are two issues with your code:

  • Your json is not a struct with 3 fields, it is a collection of structs with 3 fields. Therefore you need to change the schema and use an ArrayType.
  • Inside your database, the json data seems to be stored with tabs \t and newlines \n (it is formatted). Spark's from_json function does not seem to be able to parse that. So we need to clean it.
# same as before, but wrapped within an array
schema = ArrayType(
    StructType([
        StructField("peso", IntegerType(), False),
        StructField("numero", IntegerType(), False),
        StructField("resposta", StringType(), False) 
    ])
)

result = questoes\
    .withColumn("questoes", f.regexp_replace("questoes", "\\s", ""))\
    .withColumn("data", f.from_json("questoes", schema))
result.show(truncate=False)

which yields:

+---+---------------------------------------------------------------------------+----------------------+
|id |questoes                                                                   |data                  |
+---+---------------------------------------------------------------------------+----------------------+
|1  |[{"numero":1,"resposta":"A","peso":2},{"numero":2,"resposta":"A","peso":1}]|[{2, 1, A}, {1, 2, A}]|
+---+---------------------------------------------------------------------------+----------------------+

and the schema:

result.printSchema()
root
 |-- id: long (nullable = true)
 |-- questoes: string (nullable = true)
 |-- data: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- peso: integer (nullable = true)
 |    |    |-- numero: integer (nullable = true)
 |    |    |-- resposta: string (nullable = true)

You may drop the questoes column, I just kept it to display the cleansed json.

Upvotes: 1

Related Questions