Samuel Appleton
Samuel Appleton

Reputation: 31

External table in Azure synapse - Error as parquet file is removed

I'm writing a parquet file to an azure data lake storage system, via databricks. I'm using the following to write the parquet file:

cultural_split.write.partitionBy("filename").mode('overwrite').parquet("/mnt/crime/transformed-data")

I'm then creating an external table using the parquet file, and I can query the table using SQL in synapse.After roughly 30 minutes I then get this error while trying to query the table:

External table 'dbo.cultural_data' is not accessible because location does not exist or it is used by another process.

I then check my data lake and can see that the parquet file is no longer there. If I check the 'show' deleted objects, I still can't see the file.

Is there something that I'm missing while creating the external table? I have soft deletes turned off for this blob storage and I can't think of any other process that might be deleting this file.

Here's the code I use to create my external table:

IF NOT EXISTS (SELECT * FROM sys.external_file_formats WHERE name = 'SynapseParquetFormat') 
    CREATE EXTERNAL FILE FORMAT [SynapseParquetFormat] 
    WITH ( FORMAT_TYPE = PARQUET)
GO

IF NOT EXISTS (SELECT * FROM sys.external_data_sources WHERE name = 'crime-data_crimedatasam_dfs_core_windows_net') 
    CREATE EXTERNAL DATA SOURCE [crime-data_crimedatasam_dfs_core_windows_net] 
    WITH (
        LOCATION = 'abfss://[email protected]' 
    )
GO

CREATE EXTERNAL TABLE dbo.crime_data (
    [ethnicity] nvarchar(80),
    [police_district] nvarchar(80),
    [lower_age] int,
    [upper_age] int
    )
    WITH (
    LOCATION = 'transformed-data/filename=crime_data/part-00000-tid-3588581976832316109-20e2477d-82c0-47d9-ba75-cf341910855e-20-1.c000.snappy.parquet',
    DATA_SOURCE = [crime-data_sam_dfs_core_windows_net],
    FILE_FORMAT = [SynapseParquetFormat]
    )
GO


SELECT TOP 100 * FROM dbo.crime_data
GO

Many thanks,

Sam

Upvotes: 0

Views: 896

Answers (1)

  • Ensure that no other programs or tasks are currently using or changing the information in the area where you are saving the parquet files.
  • In certain scenarios, you might decide not to set a Time-to-Live(TTL) on a blob. If you do this, Azure CDN will automatically enforce a default TTL of seven days, unless you have configured caching rules in the Azure portal. It's important to note that this default TTL only applies to general web delivery optimizations.

Know more about Manage expiration of Azure Blob storage in Azure CDN

  • When creating an external table, you might opt for a broader path definition rather than specifying the exact file. This approach ensures that the external table can adapt better to any changes in files or partitions, making it more resilient.

I have tried the below approach pointing the partion folder and creating an external table:

IF NOT EXISTS (SELECT * FROM sys.external_file_formats WHERE name = 'SynapseParquetFormat') 
    CREATE EXTERNAL FILE FORMAT [SynapseParquetFormat] 
    WITH ( FORMAT_TYPE = PARQUET)
GO

IF NOT EXISTS (SELECT * FROM sys.external_data_sources WHERE name = 'folder02_dileepsynpadls_dfs_core_windows_net') 
    CREATE EXTERNAL DATA SOURCE [folder02_dileepsynpadls_dfs_core_windows_net] 
    WITH (
        LOCATION = 'abfss://<CONTAINER>@<STORAGE ACCOUNT>.dfs.core.windows.net' 
    )
GO
CREATE EXTERNAL TABLE dbo.file_name_ext_table (
    [age] bigint
    )
    WITH (

    LOCATION = 'transformed-data/**',
    DATA_SOURCE = [folder02_dileepsynpadls_dfs_core_windows_net],
    FILE_FORMAT = [SynapseParquetFormat]
    )
GO
SELECT TOP 100 * FROM dbo.file_name_ext_table
GO

enter image description here

Upvotes: 0

Related Questions