Parquet schema management

Question

I have recently started working on a new project where we use Spark to write/read data in Parquet format. The project is changing rapidly and here and there we do need to change the schema of parquet files regularly. I am currently struggling with versioning data and code.

We use versioning system for our codebase but its very hard (at least in my opinion) to do it for data itself. I also have migration script, which I use to migrate data from old schema to the new schema but along the way I loose the information about what was the schema of a parquet file before running the migration. It is my priority to know the original schema as well.

So my questions would be

How do you keep track of parquet files which have schema inconsistencies in your HDFS? I have several Terabytes of parquet files.
After running migration script to convert your current schema(original) to the new schema, how do you keep track of original schema?
Is there any existing tool to achieve this or I have to write something of my own?

Parquet schema management

Answers (1)

Related Questions