How to process JSON field from relation database with PYSPARK?

Question

I am trying to process JSON column from PostgreSQL database. I am able to connect to database using:

import os
import pyspark
import findspark
from pyspark import SparkContext
from pyspark.sql import SQLContext
findspark.init(os.environ['SPARK_HOME'])

# DB credentials
user = os.environ['EVENTS_DEV_UID']
password = os.environ['EVENTS_DEV_PWD']
host = os.environ['EVENTS_DEV_HOST']
port = os.environ['EVENTS_DEV_PORT']
db = os.environ['EVENTS_DEV_DBNAME']

# Initiate spark session
sc = SparkContext()
spark = SQLContext(sc)

# Set properties
properties = {"user": user, "password": password, "driver": "org.postgresql.Driver"}

# Load data
df = spark.read.jdbc(
    url = 'jdbc:postgresql://' + host + ':' + port + '/' + db,
    table = 'events',
    properties = properties)

The problem starts with casting JSON field. Spark doesn't recognize struct format of params. When I print schema:

df.printSchema()

root
|-- time: timestamp (nullable = true)
|-- name: string (nullable = true)
|-- params: string (nullable = true)

When I try to cast string to struct:

df = df.withColumn('params', df.params.cast('struct'))

I am getting following error:

ParseException: '
DataType struct is not supported.(line 1, pos 0)

== SQL ==
struct
^^^
'

I guess problem is the escape characters. Anybody has idea how to proceed?

How to process JSON field from relation database with PYSPARK?

Answers (1)

Related Questions