Transform aws glue get-tables output from json to PySpark Dataframe

Question

I'm trying to transform the json output of aws glue get-tables command into a PySpark dataframe.

After reading the json output with this command:

   df = spark.read.option("inferSchema", "true") \
   .option("multiline", "true") \
   .json("tmp/my_json.json")

I get the following from printSchema:

root
 |-- TableList: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- CatalogId: string (nullable = true)
 |    |    |-- CreateTime: string (nullable = true)
 |    |    |-- CreatedBy: string (nullable = true)
 |    |    |-- DatabaseName: string (nullable = true)
 |    |    |-- IsRegisteredWithLakeFormation: boolean (nullable = true)
 |    |    |-- LastAccessTime: string (nullable = true)
 |    |    |-- Name: string (nullable = true)
 |    |    |-- Owner: string (nullable = true)
 |    |    |-- Parameters: struct (nullable = true)
 |    |    |    |-- CrawlerSchemaDeserializerVersion: string (nullable = true)
 |    |    |    |-- CrawlerSchemaSerializerVersion: string (nullable = true)
 |    |    |    |-- UPDATED_BY_CRAWLER: string (nullable = true)
 |    |    |    |-- averageRecordSize: string (nullable = true)
 |    |    |    |-- classification: string (nullable = true)
 |    |    |    |-- compressionType: string (nullable = true)
 |    |    |    |-- objectCount: string (nullable = true)
 |    |    |    |-- recordCount: string (nullable = true)
 |    |    |    |-- sizeKey: string (nullable = true)
 |    |    |    |-- spark.sql.create.version: string (nullable = true)
 |    |    |    |-- spark.sql.sources.schema.numPartCols: string (nullable = true)
 |    |    |    |-- spark.sql.sources.schema.numParts: string (nullable = true)
 |    |    |    |-- spark.sql.sources.schema.part.0: string (nullable = true)
 |    |    |    |-- spark.sql.sources.schema.part.1: string (nullable = true)
 |    |    |    |-- spark.sql.sources.schema.partCol.0: string (nullable = true)
 |    |    |    |-- spark.sql.sources.schema.partCol.1: string (nullable = true)
 |    |    |    |-- typeOfData: string (nullable = true)
 |    |    |-- PartitionKeys: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- Name: string (nullable = true)
 |    |    |    |    |-- Type: string (nullable = true)
 |    |    |-- Retention: long (nullable = true)
 |    |    |-- StorageDescriptor: struct (nullable = true)
 |    |    |    |-- BucketColumns: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- Columns: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- Name: string (nullable = true)
 |    |    |    |    |    |-- Type: string (nullable = true)
 |    |    |    |-- Compressed: boolean (nullable = true)
 |    |    |    |-- InputFormat: string (nullable = true)
 |    |    |    |-- Location: string (nullable = true)
 |    |    |    |-- NumberOfBuckets: long (nullable = true)
 |    |    |    |-- OutputFormat: string (nullable = true)
 |    |    |    |-- Parameters: struct (nullable = true)
 |    |    |    |    |-- CrawlerSchemaDeserializerVersion: string (nullable = true)
 |    |    |    |    |-- CrawlerSchemaSerializerVersion: string (nullable = true)
 |    |    |    |    |-- UPDATED_BY_CRAWLER: string (nullable = true)
 |    |    |    |    |-- averageRecordSize: string (nullable = true)
 |    |    |    |    |-- classification: string (nullable = true)
 |    |    |    |    |-- compressionType: string (nullable = true)
 |    |    |    |    |-- objectCount: string (nullable = true)
 |    |    |    |    |-- recordCount: string (nullable = true)
 |    |    |    |    |-- sizeKey: string (nullable = true)
 |    |    |    |    |-- spark.sql.create.version: string (nullable = true)
 |    |    |    |    |-- spark.sql.sources.schema.numPartCols: string (nullable = true)
 |    |    |    |    |-- spark.sql.sources.schema.numParts: string (nullable = true)
 |    |    |    |    |-- spark.sql.sources.schema.part.0: string (nullable = true)
 |    |    |    |    |-- spark.sql.sources.schema.part.1: string (nullable = true)
 |    |    |    |    |-- spark.sql.sources.schema.partCol.0: string (nullable = true)
 |    |    |    |    |-- spark.sql.sources.schema.partCol.1: string (nullable = true)
 |    |    |    |    |-- typeOfData: string (nullable = true)
 |    |    |    |-- SerdeInfo: struct (nullable = true)
 |    |    |    |    |-- Parameters: struct (nullable = true)
 |    |    |    |    |    |-- serialization.format: string (nullable = true)
 |    |    |    |    |-- SerializationLibrary: string (nullable = true)
 |    |    |    |-- SortColumns: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- StoredAsSubDirectories: boolean (nullable = true)
 |    |    |-- TableType: string (nullable = true)
 |    |    |-- UpdateTime: string (nullable = true)

But just one column with the whole json is created in df:

+--------------------+
|           TableList|
+--------------------+
|[[903342277921, 2...|
+--------------------+

Is there a way to programmatically (and dynamically) create the dataframe in the same way that is referenced in printSchema?

Thanks in advance!

Robert Kossendey · Accepted Answer

You can use the explode() function to turn the elements of an array to separate rows:

df = df.select('*',explode(df['TableList']).select('col.*')

Transform aws glue get-tables output from json to PySpark Dataframe

Answers (1)

Related Questions