Reputation: 885
I have a delta table with millions of rows and several columns of various types, incl. nested Structs. And I want to create an empty DataFrame clone of the delta table, in the runtime - i.e. same schema, no rows.
Can I read schema without reading any content of the table (so that I can then create an empty DataFrame based on the schema)? I assumed it would be possible since there are the delta transaction logs and that Delta needs to quickly access table schemas itself.
What I tried:
df.schema
- Accessing schema
immediately after the delta table load took several minutes as well.limit(0)
- Calling limit(0)
immediately after the load still took several minutes.limit(0).cache()
- limit
gets sometimes moved around in the plan, so I also tried adding cache
to "fix its position".Are there any other options? Would it be correct to just access the transaction log JSON and read the schema from the latest transaction? (Given that we
Context: I want to add a step into our CI that checks the code and various assumptions around schema before it gets actually run with the data.
Upvotes: 8
Views: 18632
Reputation: 1
I have found this to work in a similar situation.
df_tab = spark.read.load("dbfs:Path")
df_tab.schema
It returns the schema info of the loaded table
Upvotes: 0
Reputation: 87299
When you access schema of the Delta it doesn't go through all the data as Delta stores the schema in the transaction log itself, so df.schema
should be enough. But when transaction log accessed, it may require sometime to reconstruct the actual schema from the JSON/Parquet files that are used for transaction log. Although several minutes is quite strange & you need to dig into execution plan.
I wouldn't recommend to read transaction log directly as its format is an internal thing, plus the latest transaction may not contain schema (it's not put into every log file, only when changes are happening).
Upvotes: 3