I have a delta table with millions of rows and several columns of various types, incl. nested Structs. And I want to create an empty DataFrame clone of the delta table , in the runtime - i.e. same schema, no rows. Can I read schema without reading any content of the table (so that I can then create an empty DataFrame based on the schema)? I assumed it would be possible since there are the delta transaction logs and that Delta needs to quickly access table schemas itself . What I tried: df.schema - Accessing schema immediately after the delta table load took several minutes as well. limit(0) - Calling limit(0) immediately after the load still took several minutes. limit(0).cache() - limit gets sometimes moved around in the plan, so I also tried adding cache to "fix its position". Are there any other options? Would it be correct to just access the transaction log JSON and read the schema from the latest transaction? (Given that we Context: I want to add a step into our CI that checks the code and various assumptions around schema before it gets actually run with the data.

Reputation: 885

How to get schema of Delta table without reading content?

I have a delta table with millions of rows and several columns of various types, incl. nested Structs. And I want to create an empty DataFrame clone of the delta table, in the runtime - i.e. same schema, no rows.

Can I read schema without reading any content of the table (so that I can then create an empty DataFrame based on the schema)? I assumed it would be possible since there are the delta transaction logs and that Delta needs to quickly access table schemas itself.

What I tried:

df.schema - Accessing schema immediately after the delta table load took several minutes as well.
limit(0) - Calling limit(0) immediately after the load still took several minutes.
limit(0).cache() - limit gets sometimes moved around in the plan, so I also tried adding cache to "fix its position".

Are there any other options? Would it be correct to just access the transaction log JSON and read the schema from the latest transaction? (Given that we

Context: I want to add a step into our CI that checks the code and various assumptions around schema before it gets actually run with the data.

Upvotes: 8

Answers (2)

Erich Huckschlag

Reputation: 1

I have found this to work in a similar situation.

df_tab = spark.read.load("dbfs:Path")
df_tab.schema

It returns the schema info of the loaded table

Upvotes: 0

Alex Ott

Reputation: 87299

When you access schema of the Delta it doesn't go through all the data as Delta stores the schema in the transaction log itself, so df.schema should be enough. But when transaction log accessed, it may require sometime to reconstruct the actual schema from the JSON/Parquet files that are used for transaction log. Although several minutes is quite strange & you need to dig into execution plan.

I wouldn't recommend to read transaction log directly as its format is an internal thing, plus the latest transaction may not contain schema (it's not put into every log file, only when changes are happening).

Upvotes: 3

How to get schema of Delta table without reading content?

Answers (2)

Related Questions