I want to process some parquet files (with snappy compression) using AutoLoader in Databricks. A lot of those files are empty or contain just one record. Also, I cannot change how they are created, nor compact them. Here are some of the approaches I tried so far: I created a python notebook in Databricks and tried using AutoLoader to load the data. When I run it for a single table/folder, I can process it without a problem. However, when calling that notebook in a for loop for other tables ( for item in active_tables_metadata: -> dbutils.notebook.run("process_raw", 0, item) ) , I only get empty folders in the target. I created a Databricks Workflows Job and called the same notebook for each table/folder (sending the name/path of the table/folder via a parameter). This way, every table/folder was processed. I used DBX to package the python scripts into a wheel and use it inside Databricks Workflows Job tasks as entrypoints. When doing this, I managed to create the same workflow as in point 2 above but, instead of calling a notebook, I am calling a python script (specified in entypoint of the task). Unfortunately, this way I only get empty folder in the target. Copied over to a python notebook in Databricks all functions used in DBX python wheel and ran the notebook for one table/folder. I only got an empty folder in the target. I have set the following AutoLoader configurations: "cloudFiles.tenantId" "cloudFiles.clientId" "cloudFiles.clientSecret" "cloudFiles.resourceGroup" "cloudFiles.subscriptionId" "cloudFiles.format": "parquet" "pathGlobFilter": "*.snappy" "cloudFiles.useNotifications": True "cloudFiles.includeExistingFiles": True "cloudFiles.allowOverwrites": True I use the following readStream configurations: spark.readStream.format("cloudFiles") .options(**CLOUDFILE_CONFIG) .option("cloudFiles.format", "parquet") .option("pathGlobFilter", "*.snappy") .option("recursiveFileLookup", True) .schema(schema) .option("locale", "de-DE") .option("dateFormat", "dd.MM.yyyy") .option("timestampFormat", "MM/dd/yyyy HH:mm:ss") .load(<path-to-source>) And the following writeStream configurations: df.writeStream.format("delta") .outputMode("append") .option("checkpointLocation", <path_to_checkpoint>) .queryName(<processed_table_name>) .partitionBy(<partition-key>) .option("mergeSchema", True) .trigger(once=True) .start(<path-to-target>) My prefered solution would be to use DBX but I don't know why the job is succeeding yet, I only see empty folders in the target location. This is very strange behavior because I think AutoLoader is timing out reading only empty files after some time! P.S. the same is also happening when I use parquet spark streaming instead of AutoLoader. Do you know of any reason why this is happening and how can I overcome this issue?

azure-databricksautoloadpython-wheeldbxspark-notebook

Reputation: 43

AutoLoader with a lot of empty parquet files

I want to process some parquet files (with snappy compression) using AutoLoader in Databricks. A lot of those files are empty or contain just one record. Also, I cannot change how they are created, nor compact them.

Here are some of the approaches I tried so far:

I created a python notebook in Databricks and tried using AutoLoader to load the data. When I run it for a single table/folder, I can process it without a problem. However, when calling that notebook in a for loop for other tables (for item in active_tables_metadata: -> dbutils.notebook.run("process_raw", 0, item)) , I only get empty folders in the target.
I created a Databricks Workflows Job and called the same notebook for each table/folder (sending the name/path of the table/folder via a parameter). This way, every table/folder was processed.
I used DBX to package the python scripts into a wheel and use it inside Databricks Workflows Job tasks as entrypoints. When doing this, I managed to create the same workflow as in point 2 above but, instead of calling a notebook, I am calling a python script (specified in entypoint of the task). Unfortunately, this way I only get empty folder in the target.
Copied over to a python notebook in Databricks all functions used in DBX python wheel and ran the notebook for one table/folder. I only got an empty folder in the target.

I have set the following AutoLoader configurations:

"cloudFiles.tenantId"
"cloudFiles.clientId"
"cloudFiles.clientSecret"
"cloudFiles.resourceGroup"
"cloudFiles.subscriptionId"
"cloudFiles.format": "parquet"
"pathGlobFilter": "*.snappy"
"cloudFiles.useNotifications": True
"cloudFiles.includeExistingFiles": True
"cloudFiles.allowOverwrites": True

I use the following readStream configurations:

spark.readStream.format("cloudFiles")
     .options(**CLOUDFILE_CONFIG)
     .option("cloudFiles.format", "parquet")
     .option("pathGlobFilter", "*.snappy")
     .option("recursiveFileLookup", True)
     .schema(schema)
     .option("locale", "de-DE")
     .option("dateFormat", "dd.MM.yyyy")
     .option("timestampFormat", "MM/dd/yyyy HH:mm:ss")
     .load(<path-to-source>)

And the following writeStream configurations:

df.writeStream.format("delta")
  .outputMode("append")
  .option("checkpointLocation", <path_to_checkpoint>)
  .queryName(<processed_table_name>)
  .partitionBy(<partition-key>)
  .option("mergeSchema", True)
  .trigger(once=True)
  .start(<path-to-target>)

My prefered solution would be to use DBX but I don't know why the job is succeeding yet, I only see empty folders in the target location. This is very strange behavior because I think AutoLoader is timing out reading only empty files after some time!

P.S. the same is also happening when I use parquet spark streaming instead of AutoLoader.

Do you know of any reason why this is happening and how can I overcome this issue?

Upvotes: 0

Answers (2)

Abhi

Reputation: 6588

If you are reading files written in parquet with snappy compression the file extention is '.snappy.parquet'. You can try to change the pathGlobFilter to match *.snappy.parquet. My second doubt is regarding cloudFiles.allowOverwrites": True , you may give it a try without this option.

Upvotes: 0

Marlon Menjivar

Reputation: 33

Are you specifying the schema of the streaming read? (Sorry, can't add comments yet)

Upvotes: 0

AutoLoader with a lot of empty parquet files

Answers (2)

Related Questions