dat_tester
dat_tester

Reputation: 11

how do I import a table DIRECTLY into a Python dataframe within databricks?

currently working within a dev environment in Databricks using a notebook to apply some Python code to analyse some dummy data (just a few 1,000 rows) held in within a database table, I then deploy this to the main environment and run it on the real data, (100's of millions of rows)

to start with I just need values from a single column that meet a certain criteria, in order to get at the data I'm currently doing this:

  1. spk_data = spark.sql("SELECT field FROM database.table WHERE field == 'value'")
  2. data = spk_data.toPandas()

then the rest of the Python notebook does its thing on that data which works fine in the dev environment but when I run it for real it falls over at line 2 saying it's out of memory

I want to import the data DIRECTLY into the Pandas dataframe and so remove the need to convert from Spark as I'm assuming that will avoid the error but after a LOT of Googling I still can't work out how, the only thing I've tried that appears syntactically valid is:

data = pd.read_table (r'database.table')

but just get:

'PermissionError: [Errno 13] Permission denied:'

(nb. unfortunately I have no control over the content, form or location of the database I'm querying)

Upvotes: 1

Views: 9038

Answers (2)

Daniel
Daniel

Reputation: 1242

Your assumption is very likely to be untrue.

Spark is a distributed computation engine, pandas is a single-node toolset. So when you run a query on milions of rows it's likely to fail. When doing df.toPandas, Spark moves all of the data to your driver node, so if it's more than driver memory, it's going to fail with out of memory exception. In other words, if your dataset is larger then memory, pandas are not going to work well.

Also, when using pandas on databricks you are missing all of the benefits of using the underlying cluster. You are just using the driver.

There are two sensible options to solve this:

  • redo your solution using spark
  • use koalas which has API mostly compatible with pandas

Upvotes: 1

Harish Vutukuri
Harish Vutukuri

Reputation: 1150

You've to use pd.read_sql_query for this case.

Upvotes: 1

Related Questions