Reputation: 41
I'm loading a file from a jdbc, which has a JSON object formated as:
[
{
"numero": 1,
"resposta": "A",
"peso": 2
},
{
"numero": 2,
"resposta": "A",
"peso": 1
},
...
]
Its datatype is set as json (postgresql), but when loading in into spark, it gets loaded with newline and tab characters:
I tried using the following schema, which results in null (I imagine thats because I have to iterate through the list, but I'm not sure how to do that):
schema = StructType(
[
StructField("peso", IntegerType(), False),
StructField("numero", IntegerType(), False),
StructField("resposta", StringType(), False)
]
)
questoes.withColumn("questoes", from_json("questoes", schema)).show(truncate=200)
Output:
Desired DataFrame:
numero | resposta | peso |
---|---|---|
1 | A | 2 |
2 | A | 1 |
...
Code used to read from the DB:
spark = SparkSession.builder.config(
'spark.driver.extraClassPath', 'C:/Users/vitor/AppData/Roaming/DBeaverData/drivers/maven/maven-central/org.postgresql/postgresql-42.2.25.jar').getOrCreate()
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
url = 'jdbc:postgresql://localhost:5432/informacoes_concursos'
properties = {'user': 'postgres', 'password': '123'}
gabaritos = spark.read.jdbc(url, table="gabaritos", properties=properties)
concursos = spark.read.jdbc(url, table="concursos", properties=properties)
Edit: I fixed the newline and tab characters by changing the dtype from json to jsonb.
Upvotes: 0
Views: 89
Reputation: 10406
So there are two issues with your code:
ArrayType
.\t
and newlines \n
(it is formatted). Spark's from_json
function does not seem to be able to parse that. So we need to clean it.# same as before, but wrapped within an array
schema = ArrayType(
StructType([
StructField("peso", IntegerType(), False),
StructField("numero", IntegerType(), False),
StructField("resposta", StringType(), False)
])
)
result = questoes\
.withColumn("questoes", f.regexp_replace("questoes", "\\s", ""))\
.withColumn("data", f.from_json("questoes", schema))
result.show(truncate=False)
which yields:
+---+---------------------------------------------------------------------------+----------------------+
|id |questoes |data |
+---+---------------------------------------------------------------------------+----------------------+
|1 |[{"numero":1,"resposta":"A","peso":2},{"numero":2,"resposta":"A","peso":1}]|[{2, 1, A}, {1, 2, A}]|
+---+---------------------------------------------------------------------------+----------------------+
and the schema:
result.printSchema()
root
|-- id: long (nullable = true)
|-- questoes: string (nullable = true)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- peso: integer (nullable = true)
| | |-- numero: integer (nullable = true)
| | |-- resposta: string (nullable = true)
You may drop the questoes
column, I just kept it to display the cleansed json.
Upvotes: 1