Reputation: 65
I'm trying to transform the json output of aws glue get-tables command into a PySpark dataframe.
After reading the json output with this command:
df = spark.read.option("inferSchema", "true") \
.option("multiline", "true") \
.json("tmp/my_json.json")
I get the following from printSchema:
root
|-- TableList: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- CatalogId: string (nullable = true)
| | |-- CreateTime: string (nullable = true)
| | |-- CreatedBy: string (nullable = true)
| | |-- DatabaseName: string (nullable = true)
| | |-- IsRegisteredWithLakeFormation: boolean (nullable = true)
| | |-- LastAccessTime: string (nullable = true)
| | |-- Name: string (nullable = true)
| | |-- Owner: string (nullable = true)
| | |-- Parameters: struct (nullable = true)
| | | |-- CrawlerSchemaDeserializerVersion: string (nullable = true)
| | | |-- CrawlerSchemaSerializerVersion: string (nullable = true)
| | | |-- UPDATED_BY_CRAWLER: string (nullable = true)
| | | |-- averageRecordSize: string (nullable = true)
| | | |-- classification: string (nullable = true)
| | | |-- compressionType: string (nullable = true)
| | | |-- objectCount: string (nullable = true)
| | | |-- recordCount: string (nullable = true)
| | | |-- sizeKey: string (nullable = true)
| | | |-- spark.sql.create.version: string (nullable = true)
| | | |-- spark.sql.sources.schema.numPartCols: string (nullable = true)
| | | |-- spark.sql.sources.schema.numParts: string (nullable = true)
| | | |-- spark.sql.sources.schema.part.0: string (nullable = true)
| | | |-- spark.sql.sources.schema.part.1: string (nullable = true)
| | | |-- spark.sql.sources.schema.partCol.0: string (nullable = true)
| | | |-- spark.sql.sources.schema.partCol.1: string (nullable = true)
| | | |-- typeOfData: string (nullable = true)
| | |-- PartitionKeys: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- Name: string (nullable = true)
| | | | |-- Type: string (nullable = true)
| | |-- Retention: long (nullable = true)
| | |-- StorageDescriptor: struct (nullable = true)
| | | |-- BucketColumns: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | | |-- Columns: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- Name: string (nullable = true)
| | | | | |-- Type: string (nullable = true)
| | | |-- Compressed: boolean (nullable = true)
| | | |-- InputFormat: string (nullable = true)
| | | |-- Location: string (nullable = true)
| | | |-- NumberOfBuckets: long (nullable = true)
| | | |-- OutputFormat: string (nullable = true)
| | | |-- Parameters: struct (nullable = true)
| | | | |-- CrawlerSchemaDeserializerVersion: string (nullable = true)
| | | | |-- CrawlerSchemaSerializerVersion: string (nullable = true)
| | | | |-- UPDATED_BY_CRAWLER: string (nullable = true)
| | | | |-- averageRecordSize: string (nullable = true)
| | | | |-- classification: string (nullable = true)
| | | | |-- compressionType: string (nullable = true)
| | | | |-- objectCount: string (nullable = true)
| | | | |-- recordCount: string (nullable = true)
| | | | |-- sizeKey: string (nullable = true)
| | | | |-- spark.sql.create.version: string (nullable = true)
| | | | |-- spark.sql.sources.schema.numPartCols: string (nullable = true)
| | | | |-- spark.sql.sources.schema.numParts: string (nullable = true)
| | | | |-- spark.sql.sources.schema.part.0: string (nullable = true)
| | | | |-- spark.sql.sources.schema.part.1: string (nullable = true)
| | | | |-- spark.sql.sources.schema.partCol.0: string (nullable = true)
| | | | |-- spark.sql.sources.schema.partCol.1: string (nullable = true)
| | | | |-- typeOfData: string (nullable = true)
| | | |-- SerdeInfo: struct (nullable = true)
| | | | |-- Parameters: struct (nullable = true)
| | | | | |-- serialization.format: string (nullable = true)
| | | | |-- SerializationLibrary: string (nullable = true)
| | | |-- SortColumns: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | | |-- StoredAsSubDirectories: boolean (nullable = true)
| | |-- TableType: string (nullable = true)
| | |-- UpdateTime: string (nullable = true)
But just one column with the whole json is created in df:
+--------------------+
| TableList|
+--------------------+
|[[903342277921, 2...|
+--------------------+
Is there a way to programmatically (and dynamically) create the dataframe in the same way that is referenced in printSchema?
Thanks in advance!
Upvotes: 0
Views: 629
Reputation: 7018
You can use the explode() function to turn the elements of an array to separate rows:
df = df.select('*',explode(df['TableList']).select('col.*')
Upvotes: 1