Reputation: 1
Is there any way to read data into pyspark dataframe from sql-server table based on condition, eg read only rows where column 'time_stamp' has current date?
Alternativey, I want to translate :
select * from table_name where time_stamp=cast(getdate() as date)
into pyspark dataframe.
I am using :
remote_table = (spark.read.format("sqlserver")
.option("host", "host_name")
.option("user", "use_name")
.option("password", "password")
.option("database", "database_name")
.option("dbtable", "dbo.table_name")
.load() )
which reads entire table 'table_name'. I just need to read rows that satisfy a condition, like 'where' clause in SQL.
Upvotes: 0
Views: 901
Reputation: 1
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder.appName("DateRangeQuery").getOrCreate()
# Define your JDBC connection properties
connection_properties = {
"user": "<username>",
"password": "<password>",
"url": "jdbc:sqlserver://<server>:<port>;databaseName=<database>"
}
# Define the date range
t_start = "2023-01-01"
t_end = "2023-12-31"
# Define your SQL query with the date range filter
sql_query = f"(SELECT * FROM table_name WHERE t_date BETWEEN '{t_start}' AND '{t_end}') AS custom_query"
# Read the filtered data using the SQL query
df = spark.read.jdbc(url=connection_properties["url"], table=sql_query, properties=connection_properties)
# Show the filtered DataFrame
df.show()
# Stop the Spark session
spark.stop()
Upvotes: 0