unknown
unknown

Reputation: 251

Parquet schema management

I have recently started working on a new project where we use Spark to write/read data in Parquet format. The project is changing rapidly and here and there we do need to change the schema of parquet files regularly. I am currently struggling with versioning data and code.

We use versioning system for our codebase but its very hard (at least in my opinion) to do it for data itself. I also have migration script, which I use to migrate data from old schema to the new schema but along the way I loose the information about what was the schema of a parquet file before running the migration. It is my priority to know the original schema as well.

So my questions would be

Upvotes: 4

Views: 1388

Answers (1)

Shantanu singh
Shantanu singh

Reputation: 82

You can use delta lake it has feature of overwriting schema and maintaining previous versions of data

delta lake is basically a bunch of parquet files with a delta log(commit log)

data.write.format("parquet").mode("overwrite").save("/tmp/delta-table")

The above code snippet overwrites normal parquet file which means that previous data will be overwritten

data.write.format("delta").mode("overwrite").save("/tmp/delta-table")

The above is delta lake overwrite it go and check the delta log and overwrite a new version of data in the delta lake as version 1 with time stamp(if previous data was version zero) we can also time travel(read previous versions of data) in delta lake

df = spark.read.format("delta").option("versionAsOf", 0).load("/tmp/delta-table")

this code can be used read zeroth version of the data

Upvotes: 2

Related Questions