Reputation: 251
I have recently started working on a new project where we use Spark to write/read data in Parquet format. The project is changing rapidly and here and there we do need to change the schema of parquet files regularly. I am currently struggling with versioning data and code.
We use versioning system for our codebase but its very hard (at least in my opinion) to do it for data itself. I also have migration script, which I use to migrate data from old schema to the new schema but along the way I loose the information about what was the schema of a parquet file before running the migration. It is my priority to know the original schema as well.
So my questions would be
Upvotes: 4
Views: 1388
Reputation: 82
You can use delta lake it has feature of overwriting schema and maintaining previous versions of data
delta lake is basically a bunch of parquet files with a delta log(commit log)
data.write.format("parquet").mode("overwrite").save("/tmp/delta-table")
The above code snippet overwrites normal parquet file which means that previous data will be overwritten
data.write.format("delta").mode("overwrite").save("/tmp/delta-table")
The above is delta lake overwrite it go and check the delta log and overwrite a new version of data in the delta lake as version 1 with time stamp(if previous data was version zero) we can also time travel(read previous versions of data) in delta lake
df = spark.read.format("delta").option("versionAsOf", 0).load("/tmp/delta-table")
this code can be used read zeroth version of the data
Upvotes: 2