Reputation: 83
Objective
I have an S3 folder full of json files with varying schemas, including arrays (a dynamodb backup, as it happens). However, while the schemas vary, all files contain some common elements, such as 'id' or 'name', as well as nested arrays of varying lengths, such as 'selected items'. I want to be able to parse out these elements at my leisure.
I have a working approach using an external ETL tool (KNIME), which I'm looking to replicate in a serverless manner via Glue.
Background
The working approach is:
varchar(65535)
entry. json_extract_path_text
and json_array_length
Now, this seems like a task that should be a fit for Glue. Namely, what I want to do is, either:
Load all data using Spectrum, as above
In Glue, create a dynamic frame from the Spectrum table
Parse data from the dynamic frame as above, using functions such as pyspark.sql.functions.explode()
or maybe using Glue's Relationalize transform
Or:
Results so far
Unfortunately I haven’t been able to get either of these approaches to work. For the various approaches, the blockers have been:
The ‘single varchar(65535) column’ approach can be achieved by loading the data as external table in Spectrum, however it seems that Spectrum tables can’t be loaded into Glue as dynamic frames (note that the relevant table exists in the Glue catalog where it is shown as having the expected varchar(65535) schema). Working in Zeppelin Notebooks, I’ve found that
newFrame = glueContext.create_dynamic_frame.from_catalog(database="<spectrum database>", table_name="<spectrum one column varchar table>")
Runs successfully, however produces a table with newFrame.count() = 0
and newFrame.toDF().show(n)
, for any value of n produces an odd output of the form:
++
||
++
++
In short, it appears that pySpark can’t work directly with Spectrum tables through Glue.
Crawl a Spectrum table using a Crawler. Here I have pointed a crawler at the desired table via Glue connection to my Redshift cluster. However, this gives an S3 endpoint validation failure which I have yet to be able to address. I’m reluctant to dig deep into the VPC configuration given that I’m already pretty uncertain that pointing a Crawler at a Spectrum table is going to be the right approach.
In short, I have found no way to dynamically ingest and parse non-scalar json on S3 using either a Glue Crawler or a combination of Glue and Redshift Spectrum. This is arguably not an esoteric task - in fact it is something that would need to be achieved by anyone wanting a relatively automated approach to reporting on data from a dynamodb based web app.
Question
So my question, in a single statement: Is there any way to parse non-scalar json files on S3, with inconsistent schemas, using Glue (plus, potentially another AWS service such as RS Spectrum)?
Upvotes: 0
Views: 3033
Reputation: 1929
So there's a few things that I assume are going on in the background.
I assume you defined an external table in Redshift Spectrum that points to S3? If so then that's not the best approach. Instead define an external schema that points to a table in Glue data catalog. The effect is that RS Spectrum will see everything in that Glue database and you don't need to define separate tables.
Secondly have you tried defining a table manually in Glue data catalog? I have done extensive testing with Parquet file format, messing with the table definitions and file contents. The result is that whatever queries from that table, it will only return the data defined in table. So you can define a table that has 'id', 'name', and 'selected items' fields and everything else will be ignored.
If for any reason the previous does not seem to work then a Glue Job will help. As a note here - always use just spark, never use anything glue_context related. Anyways, in spark.read.blaah you can specify a schema parameter. Use it.
dataSchema = StructType([StructField("id",StringType(),True)
,StructField("name",StringType(),True)
,StructField("selected items",ArrayType(.. etc ...),True)
])
That way you'll get the same result - it will only parse out the data that you want.
The setup in the end should probably be something like this:
Upvotes: 2