PySpark missing column name error with numerical column names in an Azure Synapse analytics notebook

Question

I am facing this issue when creating dataframe with Azure Synapse SQL dedicated pool as a data source. Some of the columns have numerical column names such as "240". I have used the synapsesql connector in scala and then grabbing the dataframe to pyspark dataframe using spark.sql. Even though I am able to print the schema of the dataframe without any problems trying to select any of the columns with numerical names produces an error.

The error has something to do with empty aliases that correspond to column names with special characters. I have not been able to figure out whether this is a spark issue or does it have something to do with Synapse analytics a data source.

%%spark
val df = spark.read.
option(Constants.SERVER, "db.sql.azuresynapse.net").
synapsesql("DWH.table")
        
df.createOrReplaceTempView("table")

df_p = spark.sql("SELECT * FROM table")
df_p.select('240').show()
df_p.printSchema()

I have understood that I should use backticks when working with column names that have illegal characters but the following seems to produce the same error

df_p = spark.sql("SELECT * FROM table")
df_p.select('`240`').show()
df_p.printSchema()

The error produced

Py4JJavaError: An error occurred while calling o204.showString.
: com.microsoft.spark.sqlanalytics.exception.SQLAnalyticsConnectorException: com.microsoft.sqlserver.jdbc.SQLServerException: An object or column name is missing or empty. For SELECT INTO statements, verify each column has a name. For other statements, look for empty alias names. Aliases defined as "" or [] are not allowed. Change the alias to a valid name.

Could someone let me know why I end up with the error?

Thank you!

PySpark missing column name error with numerical column names in an Azure Synapse analytics notebook

Answers (1)

Related Questions