patryk_ostrowski
patryk_ostrowski

Reputation: 237

Spark JDBC: Incorrect syntax in spark.read

I am trying to read maximum id value in my table by using

  _select_sql = f"(SELECT MAX(id) FROM {tablename})"
  highest_id = spark.read.jdbc(url, table=_select_sql, properties=properties)

After executing this I am getting :

com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near the keyword 'WHERE'

When I try to read all data using highest_id = spark.read.jdbc(url, table=tablename, properties=properties) evrything is fine.

Do you know where could be mistake?

Edit:

After changing to

_select_sql = f"(SELECT MAX(id) FROM {tablename}"

I am getting:

com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near '0'.

Upvotes: 3

Views: 4921

Answers (4)

dsrjr
dsrjr

Reputation: 26

The second part (after the edit) of the issue is from not closing the parenthesis at the end of your _select_sql query:

_select_sql = f"(SELECT MAX(id) FROM {tablename})"

The first part (original post) is that some JDBC drivers/source databases require a table name be added to the query in order to work. I know this is the case for SQL Server, which your query is executing against, but, for example, Oracle does not require it. eg. for SQL Server:

_select_sql = f"(SELECT MAX(id) FROM {tablename}) PUSH_DOWN_TABLE"

The somewhat esoteric "Incorrect syntax near the keyword 'WHERE'" exception message is the result of not including the table name after the query in parenthesis.

This specific usage of the "table" option to execute a query is a method for performing a "push down" query. Using the "table" option in the spark.read.jdbc executes the query as kind of a table in the source database and only returns the result of your aggregate function "MAX". This reduces the amount of data that has to be transferred from the source database by performing the aggregation there and only sending the result.

Upvotes: 0

Hanebambel
Hanebambel

Reputation: 121

Spark internally creates a SQL statement like

SELECT * FROM <table>

So, if you want a statement as table parameter it has to look like this (just adding a name to the inline view):

_select_sql = f"(SELECT MAX(id) FROM {tablename}) tab"

and sparks cretated SQL will look like this and do the job:

SELECT * FROM (SELECT MAX(id) FROM mytable) tab

Upvotes: 1

patryk_ostrowski
patryk_ostrowski

Reputation: 237

I solved this by using

_select_sql = f"(SELECT TOP(1) id FROM {tablename} ORDER BY id DESC)"

Thanks anyway to Alex Ott, It works only in way that he gave. Don t really know why because of some sources tells this should work in similar way:

df = spark.read.jdbc(url = url, table = query, properties = properties)

Upvotes: 0

Alex Ott
Alex Ott

Reputation: 87079

You're using query as a table name - this will not work. You need to use the query option instead (see docs):

highest_id = spark.read.format("jdbc") \
  .option("url", jdbcUrl) \
  .option("query", f"SELECT MAX(id) FROM {tablename}") \
  .load() 

Upvotes: 1

Related Questions