AnttiM
AnttiM

Reputation: 11

PySpark missing column name error with numerical column names in an Azure Synapse analytics notebook

I am facing this issue when creating dataframe with Azure Synapse SQL dedicated pool as a data source. Some of the columns have numerical column names such as "240". I have used the synapsesql connector in scala and then grabbing the dataframe to pyspark dataframe using spark.sql. Even though I am able to print the schema of the dataframe without any problems trying to select any of the columns with numerical names produces an error.

The error has something to do with empty aliases that correspond to column names with special characters. I have not been able to figure out whether this is a spark issue or does it have something to do with Synapse analytics a data source.

%%spark
val df = spark.read.
option(Constants.SERVER, "db.sql.azuresynapse.net").
synapsesql("DWH.table")
        
df.createOrReplaceTempView("table")

df_p = spark.sql("SELECT * FROM table")
df_p.select('240').show()
df_p.printSchema()

I have understood that I should use backticks when working with column names that have illegal characters but the following seems to produce the same error

df_p = spark.sql("SELECT * FROM table")
df_p.select('`240`').show()
df_p.printSchema()

The error produced

Py4JJavaError: An error occurred while calling o204.showString.
: com.microsoft.spark.sqlanalytics.exception.SQLAnalyticsConnectorException: com.microsoft.sqlserver.jdbc.SQLServerException: An object or column name is missing or empty. For SELECT INTO statements, verify each column has a name. For other statements, look for empty alias names. Aliases defined as "" or [] are not allowed. Change the alias to a valid name.

Could someone let me know why I end up with the error?

Thank you!

Upvotes: 1

Views: 923

Answers (1)

IpsitaDash-MT
IpsitaDash-MT

Reputation: 1450

Read API, as Token-based authentication to a dedicated SQL pool outside of the workspace is currently not supported by the connection. SQL Auth will be required.

      val df=spark.read.option(Constants.SERVER,"samplews.database.windows.net").
      option(Constants.USER, <SQLServer Login UserName>).
      option(Constants.PASSWORD, <SQLServer Login Password>).
      synapsesql("<DBName>.<Schema>.<TableName>")

Create a temp table using the dataframe in PySpark & Run a Scala cell in the PySpark notebook using magics:

pyspark_df.createOrReplaceTempView("temptable")
val scala_df = spark.sqlContext.sql ("select * from pysparkdftemptable")

Then you can select the column using the select() and show():

df_p.select("240").show()
df_p.printSchema()

If you are facing error would request you to run the session again and see.

As this in the snip works:

enter image description here

Upvotes: 0

Related Questions