J_gsaw23
J_gsaw23

Reputation: 3

Unable to find files in dbfs after creating a parquet file in databricks

I created a parquet file and this code was executed successfully. However, in the same path "/mnt/dev/lvl1/lvl2/" there were subfolders lvl2_1 and lvl2_2 which had "delta" files. Upon attempting to access those tables, I get the error "AnalysisException: Delta table "/mnt/dev/lvl1/lvl2/lvl2_1" doesn't exist. Checking the files using dbutils.fs.ls("/mnt/dev/lvl1/lvl2/lvl2_1") shows 3 files viz. _SUCCESS, _delta_log, committed. I have checked the path and have accessed the table before many times, but now I cannot access this delta file due to above error. The only action that has been performed related to this folder was the creation of parquet file. I am not sure what caused the error. I would appreciate any information on identifying the issue here, so that it is not repeated. Please let me know in case any other information is required.

I created a parquet file using the following command:

output_path = "/mnt/dev/lvl1/lvl2/"
df\
.write\
.mode("overwrite")\
.format("parquet")\
.save(output_path)

This code was executed successfully. However, in the path "/mnt/dev/lvl1/lvl2/" there were subfolders lvl2_1 and lvl2_2 which had "delta" files. Upon attempting to access those tables, I get the error "AnalysisException: Delta table "/mnt/dev/lvl1/lvl2/lvl2_1" doesn't exist. Checking the files using dbutils.fs.ls("/mnt/dev/lvl1/lvl2/lvl2_1") shows 3 files viz. _SUCCESS, _delta_log, committed. I have checked the path and have accessed the table before many times, but now I cannot access this delta file due to above error. The only action that has been performed related to this folder was the creation of parquet file. My expectation was to save parquet file in the output_path without deleting/corrupting anything else.

Upvotes: 0

Views: 576

Answers (1)

JayashankarGS
JayashankarGS

Reputation: 8140

That is because the files present inside lvl2_1 are deleted, and the deletion is triggered due to writing parquet files in Overwrite mode. Even though you have a _delta_log file, it still gives you an error.

Whenever you try to write data to parquet, you need to specify a path that is an empty directory, or the Overwrite option should be given.

In your case, since you are overwriting the files, all underlying parquet files have been deleted. If you see the committed file in the delta table, it shows the files removed.

enter image description here

My suggestion is to save data in parquet and delta format in different directories, instead of saving it in a single directory like lvl2_par.

output_path = "/mnt/dev/lvl1/lvl2/lvl2_par/"
df\
.write\
.mode("overwrite")\
.format("parquet")\
.save(output_path)

This will never give you an error or create conflicts between parquet and delta unless you provide a parent directory with the overwriting option.

Upvotes: 0

Related Questions