user554481
user554481

Reputation: 2075

Assign SQL schema to Spark DataFrame

I'm converting my team's legacy Redshift SQL code to Spark SQL code. All the Spark examples I've seen define the schema in a non-SQL way using StructType and StructField and I'd prefer to define the schema in SQL, since most of my users know SQL but not Spark.

This is the ugly workaround I'm doing now. Is there a more elegant way that doesn't require defining an empty table just so that I can pull the SQL schema?

create_table_sql = '''
CREATE TABLE public.example (
  id LONG,
  example VARCHAR(80)
)'''
spark.sql(create_table_sql)
schema = spark.sql("DESCRIBE public.example").collect()
s3_data = spark.read.\
option("delimiter", "|")\
.csv(
    path="s3a://"+s3_bucket_path,
    schema=schema
)\
.saveAsTable('public.example')

Upvotes: 1

Views: 2620

Answers (2)

abiratsis
abiratsis

Reputation: 7336

Yes there is a way to create schema from string although I am not sure if it really looks like SQL! So you can use:

from pyspark.sql.types import _parse_datatype_string

_parse_datatype_string("id: long, example: string")

This will create the next schema:

  StructType(List(StructField(id,LongType,true),StructField(example,StringType,true)))

Or you may have a complex schema as well:

schema = _parse_datatype_string("customers array<struct<id: long, name: string, address: string>>")

StructType(
  List(StructField(
    customers,ArrayType(
      StructType(
        List(
          StructField(id,LongType,true),
          StructField(name,StringType,true),
          StructField(address,StringType,true)
        )
      ),true),true)
  )
)

You can check for more examples here

UPDATE: 23 March 2023

In the case of json string use _parse_datatype_json_string as shown next:

schema_json = """{"fields": [
        {
            "metadata": {},
            "name": "id",
            "nullable": true,
            "type": "long"
        },
        {
            "metadata": {},
            "name": "example",
            "nullable": true,
            "type": "string"
        }],
    "type": "struct"}"""
_parse_datatype_json_string(schema_json)

# StructType(List(StructField(id,LongType,true),StructField(example,StringType,true)))

Upvotes: 4

Max
Max

Reputation: 36

adding up to what has already been said, making a schema (e.g. StructType-based or JSON) is more straightforward in scala spark than in pySpark:

> import org.apache.spark.sql.types.StructType

> val s = StructType.fromDDL("customers array<struct<id: long, name: string, address: string>>")

> s

res3: org.apache.spark.sql.types.StructType = StructType(StructField(customers,ArrayType(StructType(StructField(id,LongType,true),StructField(name,StringType,true),StructField(address,StringType,true)),true),true))

> s.prettyJson
res9: String =
{
  "type" : "struct",
  "fields" : [ {
    "name" : "customers",
    "type" : {
      "type" : "array",
      "elementType" : {
        "type" : "struct",
        "fields" : [ {
          "name" : "id",
          "type" : "long",
          "nullable" : true,
          "metadata" : { }
        }, {
          "name" : "name",
          "type" : "string",
          "nullable" : true,
          "metadata" : { }
        }, {
          "name" : "address",
          "type" : "string",
          "nullable" : true,
          "metadata" : { }
        } ]
      },
      "containsNull" : true
    },
    "nullable" : true,
    "metadata" : { }
  } ]
}

Upvotes: 0

Related Questions