Mel
Mel

Reputation: 750

Trying to convert a "org.apache.spark.sql.DataFrame" object to pandas dataframe results in error "name 'dataframe' is not defined" in Databricks

I am trying to query an SQL database via jdbc connection in databricks and store the query results as a pandas dataframe. All of the methods I can find for this online involve storing it as a type of Spark object first using Scala code and then converting this to pandas. I tried for cell 1:

%scala
val df_table1 = sqlContext.read.format("jdbc").options(Map(
    ("url" -> "jdbc:sqlserver://myserver.database.windows.net:1433;database=mydb"),
    ("dbtable" -> "(select top 10 * from myschema.table) as table"),
    ("user" -> "user"),
    ("password" -> "password123"),
    ("driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver"))
).load()

which results in:

df_table1: org.apache.spark.sql.DataFrame = [var1: int, var2: string ... 50 more fields]

Great! But when I try to convert it to a pandas df in cell 2 so I can use it:

import numpy as np
import pandas as pd 

result_pdf = df_table1.select("*").toPandas()

print(result_pdf)

It generates the error message:

NameError: name 'df_table1' is not defined

How do I successfully convert this object to a pandas dataframe, or alternatively is there any way of querying the SQL database via jdbc connection using python code without needing to use Scala at all (I do not particularly like Scala syntax and would rather avoid it if at all possible)?

Upvotes: 1

Views: 611

Answers (1)

Himanshu Kumar Sinha
Himanshu Kumar Sinha

Reputation: 1786

I am assuming that your intention to to query SQL using python and if thats the case the below query will work .

%python
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
conf = SparkConf()
conf.setMaster("local").setAppName("My app")
sc = SparkContext.getOrCreate(conf=conf)
spark = SparkSession(sc)
database = "YourDBName"
table = "[dbo].[YourTabelName]"
user = "SqlUser"
password  = "SqlPassword"

DF1 = spark.read.format("jdbc") \
    .option("url", f"jdbc:sqlserver://YourAzureSql.database.windows.net:1433;databaseName={database};") \
    .option("dbtable", table) \
    .option("user", user) \
    .option("password", password) \
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
    .load()
DF1.show()

table = "[dbo].[someOthertable]"

DF2 = spark.read.format("jdbc") \
    .option("url", f"jdbc:sqlserver://YourAzureSql.database.windows.net:1433;databaseName={database};") \
    .option("dbtable", table) \
    .option("user", user) \
    .option("password", password) \
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
    .load()
DF2.show()

Finaldf = DF1.join(DF2,(DF1.Prop_0 == DF2.prop_0),how="inner").select(DF1.Prop_0,DF1.Prop_1,DF2.Address)
Finaldf.show()

Upvotes: 1

Related Questions