How do I query heterogeneous JSON data in S3?

Question

We have an Amazon S3 bucket that contains around a million JSON files, each one around 500KB compressed. These files are put there by AWS Kinesis Firehose, and a new one is written every 5 minutes. These files all describe similar events and so are logically all the same, and are all valid JSON, but have different structures/hierarchies. Also their format & line endings are inconsistent: some objects are on a single line, some on many lines, and sometimes the end of one object is on the same line as the start of another object (i.e., }{).

We need to parse/query/shred these objects and then import the results into our on-premise data warehouse SQL Server database.

Amazon Athena can't deal with the inconsistent spacing/structure. I thought of creating a Lambda function that would clean up the spacing, but that still leaves the problem of different structures. Since the files are laid down by Kinesis, which forces you to put the files in folders nested by year, month, day, and hour, we would have to create thousands of partitions every year. The limit to the number of partitions in Athena is not well known, but research suggests we would quickly exhaust this limit if we create one per hour.

I've looked at pumping the data into Redshift first and then pulling it down. Amazon Redshift external tables can deal with the spacing issues, but can't deal with nested JSON, which almost all these files have. COPY commands can deal with nested JSON, but require us to know the JSON structure beforehand, and don't allow us to access the filename, which we would need for a complete import (it's the only way we can get the date). In general, Redshift has the same problem as Athena: the inconsistent structure makes it difficult to define a schema.

I've looked into using tools like AWS Glue, but they just move data, and they can't move data into our on-premise server, so we have to find some sort of intermediary, which increases cost, latency, and maintenance overhead.

I've tried cutting out the middleman and using ZappySys' S3 JSON SSIS task to pull the files directly and aggregate them in an SSIS package, but it can't deal with the spacing issues or the inconsistent structure.

I can't be the first person to face this problem, but I just keep spinning my wheels.

Mukund · Accepted Answer

I would probably suggest 2 types of solutions

I believe MongoDB/DynamoDB/Cassandra are good at processing Heterogenous JSON structure. I am not sure about the inconsistency in ur JSON but as long as it is a valid JSON, I believe it should be ingestable in one of the above DBs. Please provide a sample JSON if possible. But these tools have their own advantages and disadvantages. The data modelling is entirely different for these No SQL's than the traditional SQLs.
I am not sure why your Lambda is not able to do a cleanup. I believe you would have tried to call a Lambda when a S3 PUT happens in a bucket. This should be able to cleanup the JSON unless there are complex processes involved.

Unless the JSON is in a proper format, no tool would be able to process it perfectly, I believe more than Athena or Spectrum, MongoDB/DyanoDB/Cassandra will be right fit to this use case

Would be great if you could share the limitations that you faced when you created a lot of partitions?

How do I query heterogeneous JSON data in S3?

Answers (2)

Related Questions