reallyJim
reallyJim

Reputation: 1346

How should I reference a Delta table with pySpark?

I'm having difficulty referencing a Delta table to perform an upsert/merge on it after creating it new. Doing it via pySpark with a typical dataframe.write.format("delta") terminology works fine. When manually creating a table with the Delta table builder API create syntax,

deltaTable = DeltaTable.createIfNotExists(spark)
                .location("/path/to/table")
                .tableName("table")
                .addColumn("id", dataType = "String")
                ...
                .execute()

I can see the folder exists in storage as expected, and can verify that it's a Delta table using DeltaTable.isDeltaTable(spark, tablePath)

The problem I encounter is when running someTable = DeltaTable.forPath(spark, tablePath) , then I get an error indicating that

pyspark.sql.utils.AnalysisException: A partition path fragment should be the form like 'part1=foo/part2=bar'

Whether I do or don't explicitly partition the table in the create statement doesn't seem to matter. I am trying to read the whole table, not a single partition.

So the question is, how do I reference the table correctly to load and manage it?

I'm using Azure Data Lake Gen 2 blob storage, though I'm not sure that's part of the issue.

As it's part of a question, my full path used for location is abfss://container_name@storage_account_name.dfs.core.windows.net/blobContainerName/delta/tables/nws, where nws has business meaning.

Upvotes: 0

Views: 1002

Answers (0)

Related Questions