nateirvin
nateirvin

Reputation: 1183

How do I query heterogeneous JSON data in S3?

We have an Amazon S3 bucket that contains around a million JSON files, each one around 500KB compressed. These files are put there by AWS Kinesis Firehose, and a new one is written every 5 minutes. These files all describe similar events and so are logically all the same, and are all valid JSON, but have different structures/hierarchies. Also their format & line endings are inconsistent: some objects are on a single line, some on many lines, and sometimes the end of one object is on the same line as the start of another object (i.e., }{).

We need to parse/query/shred these objects and then import the results into our on-premise data warehouse SQL Server database.

Amazon Athena can't deal with the inconsistent spacing/structure. I thought of creating a Lambda function that would clean up the spacing, but that still leaves the problem of different structures. Since the files are laid down by Kinesis, which forces you to put the files in folders nested by year, month, day, and hour, we would have to create thousands of partitions every year. The limit to the number of partitions in Athena is not well known, but research suggests we would quickly exhaust this limit if we create one per hour.

I've looked at pumping the data into Redshift first and then pulling it down. Amazon Redshift external tables can deal with the spacing issues, but can't deal with nested JSON, which almost all these files have. COPY commands can deal with nested JSON, but require us to know the JSON structure beforehand, and don't allow us to access the filename, which we would need for a complete import (it's the only way we can get the date). In general, Redshift has the same problem as Athena: the inconsistent structure makes it difficult to define a schema.

I've looked into using tools like AWS Glue, but they just move data, and they can't move data into our on-premise server, so we have to find some sort of intermediary, which increases cost, latency, and maintenance overhead.

I've tried cutting out the middleman and using ZappySys' S3 JSON SSIS task to pull the files directly and aggregate them in an SSIS package, but it can't deal with the spacing issues or the inconsistent structure.

I can't be the first person to face this problem, but I just keep spinning my wheels.

Upvotes: 5

Views: 2977

Answers (2)

Ghislain Fourny
Ghislain Fourny

Reputation: 7279

Rumble is an open-source (Apache 2.0) engine that allows you to use the JSONiq query language to directly query JSON (specifically, JSON Lines files) stored on S3, without having to move it anywhere else or import it into any data store. Internally, it uses Spark and DataFrames.

It was successfully tested on collections of more than 20 billion objects (10+ TB), and it also works seamlessly if the data is nested and heterogeneous (missing fields, extra fields, different types in the same field, etc). It was also tested with Amazon EMR clusters.

Update: Rumble also works with Parquet, CSV, ROOT, AVRO, text, and SVM, and on HDFS, S3, and Azure.

Upvotes: 5

Mukund
Mukund

Reputation: 946

I would probably suggest 2 types of solutions

  1. I believe MongoDB/DynamoDB/Cassandra are good at processing Heterogenous JSON structure. I am not sure about the inconsistency in ur JSON but as long as it is a valid JSON, I believe it should be ingestable in one of the above DBs. Please provide a sample JSON if possible. But these tools have their own advantages and disadvantages. The data modelling is entirely different for these No SQL's than the traditional SQLs.
  2. I am not sure why your Lambda is not able to do a cleanup. I believe you would have tried to call a Lambda when a S3 PUT happens in a bucket. This should be able to cleanup the JSON unless there are complex processes involved.

Unless the JSON is in a proper format, no tool would be able to process it perfectly, I believe more than Athena or Spectrum, MongoDB/DyanoDB/Cassandra will be right fit to this use case

Would be great if you could share the limitations that you faced when you created a lot of partitions?

Upvotes: 2

Related Questions