Marek
Marek

Reputation: 885

How to get schema of Delta table without reading content?

I have a delta table with millions of rows and several columns of various types, incl. nested Structs. And I want to create an empty DataFrame clone of the delta table, in the runtime - i.e. same schema, no rows.

Can I read schema without reading any content of the table (so that I can then create an empty DataFrame based on the schema)? I assumed it would be possible since there are the delta transaction logs and that Delta needs to quickly access table schemas itself.

What I tried:

Are there any other options? Would it be correct to just access the transaction log JSON and read the schema from the latest transaction? (Given that we

Context: I want to add a step into our CI that checks the code and various assumptions around schema before it gets actually run with the data.

Upvotes: 8

Views: 18632

Answers (2)

Erich Huckschlag
Erich Huckschlag

Reputation: 1

I have found this to work in a similar situation.

df_tab = spark.read.load("dbfs:Path")
df_tab.schema

It returns the schema info of the loaded table

Upvotes: 0

Alex Ott
Alex Ott

Reputation: 87299

When you access schema of the Delta it doesn't go through all the data as Delta stores the schema in the transaction log itself, so df.schema should be enough. But when transaction log accessed, it may require sometime to reconstruct the actual schema from the JSON/Parquet files that are used for transaction log. Although several minutes is quite strange & you need to dig into execution plan.

I wouldn't recommend to read transaction log directly as its format is an internal thing, plus the latest transaction may not contain schema (it's not put into every log file, only when changes are happening).

Upvotes: 3

Related Questions